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Abstract 

The  purpose  of  this  paper  is  to  study  the  asymptotic  behavior  of  the  Stochastic 
EM  algorithm  (SEM)  in  a  simple  particular  case  within  the  mixture  context. 
fVe  consider  the  estimation  of  the  mixing  proportion  p  of  a  two-component 
mixture  of  densities  assumed  to  be  known.  We  establish  that  the  stationary 
distribution  of  the  ergodic  Markov  chain  generated  by  SEM  is  asymptotic, 
as  the  sample  size  N  tends  to  infinity,  to  a  Gaussian  distribution  with  mean 
the  consistent  maximum  likelihood  estimate  of  p  and  variance  proportional  to 
.  Similarly,  we  determine  the  limiting  distributions  of  two  sequential 
versions  of  SEM  and  study  their  asymptotic  relative  efficiency. 
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Introduction 


The  purpose  of  the  present  article  is  to  study  the  asymptotic  behavior  of  the 
random  sequence  of  parameters  generated  by  the  Stochastic  EM  algorithm 
(SEM  algorithm,  see,  e.g.,  Celeux  and  Diebolt  (1985))  as  the  sample  size 
N  — >  oo,  in  a  simple  particular  case  within  the  mixture  context. 

The  EM  algorithm  (Dempter,  Laird  and  Rubin,  1977)  is  a  widely  appli¬ 
cable  approach  for  computing  maximum  likehood  (ML)  estimates  for  incom¬ 
plete  data.  Despite  appealing  features,  the  EM  algorithm  has  several  severe 
well- documented  drawbacks. 

In  an  attempt  to  overcome  some  of  these  drawbacks,  we  have  defined  and 
studied  a  stochastic  version  of  the  EM  algorithm,  that  we  have  called  the 
SEM  algorithm,  in  Broniatowski,  Celeux  and  Diebolt  (1983)  and  Celeux  and 
Diebolt  (1985,  1986a,  1987).  Instead  of  maximizing  the  expected  complete- 
data  loglikehood  conditional  on  the  observations  ~  the 

SEM  algorithm  first  simulates  the  missing  data  Z(;v)  from  the  conditional 
density  k(z(Ar)|x(A^),  where  0^”^^  is  the  current  guess  of  the  parame¬ 

ter,  and  then  computes  the  maximum  of  the  pseudo- completed  likelihood 
function,  thus  producing  the  updated  estimator  Note  that  the  SEM 

algorithm  can  be  seen  as  a  particular  case  of  the  MCEM  algorithm  of  Wei 
and  Tanner  (1990),  with  q  =  1  in  their  notation,  and  that  these  authors 
overlooked  Celeux  and  Diebolt’s  previous  papers.  (An  answer  to  Wei  and 
Tanner  (1990)  can  be  found  in  Biscarat,  Celeux  and  Diebolt  (1992).) 

The  random  sequence  {0^"*^}  generated  by  SEM  is  a  homogeneous  Markov 
chain  which  turns  out  to  be  ergodic  in  most  of  the  cases  of  interest  (see  Section 
2  for  a  proof  of  ergodicity  in  the  particular  mixture  case  under  consideration). 
Let  denote  its  stationary  distribution,  where  the  subscript  N  indicates 
dependence  upon  the  observed  sample  X(/v)  =  {xi, . . .  ,xn}.  The  estimator 
of  0  provided  by  SEM  is  the  mean  of  the  distribution  ^/v-  It  can  be 
approximated  by  averaging  over  a  sufficient  number  of  after  0*"*^  has 

approximately  reached  its  stationary  regime  (see  Section  2). 

Celeux  and  Diebolt  (1985,  1986a)  provide  experimental  evidence  which 
shows  that,  for  reasonable  sample  sizes,  SEM  is  often  preferable  to  EM.  It 
avoids  saddle-points  as  well  as  nonsignificant  local  maxima  of  the  likelihood 
function  and,  in  some  cases,  greatly  accelerates  the  convergence.  Moreover, 
for  mixtures,  it  allows  misspecification  of  the  number  of  components  since 
an  upper  bound  of  the  number  of  components  is  sufficient  to  ensure  conver- 
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gence  to  the  actual  number  of  components  if  the  sample  size  is  large  enough. 
Finally,  in  some  particular  cases,  SEM  may  even  provide  a  good  alternative 
when  the  E-step  evaluation  of  the  EM  algorithm  is  too  intricate.  For  instance, 
when  considering  censored  data  it  is  much  easier  to  simulate  the  censored 
data  than  to  work  with  the  expected  complete  likelihood  conditional  upon 
X(/v)  (see  Wei  and  Tanner  (1990)  and  Chauveau  (1991)). 

On  the  other  hand,  Diebolt  and  Robert  (1992)  have  highlighted  the  links 
between  SEM  and  Bayesian  sampling.  SEM  can  be  viewed  as  a  simplified 
version  of  the  Data  Augmentation  algorithm  of  Tanner  and  Wong  (1987) 
with  noninformative  priors,  where  the  step  of  simulation  of  the  posterior 
distribution  conditional  upon  the  pseudo-completed  data,  x(01x(jv),  is 

replaced  by  the  computation  of  the  mean  of  7r(0(x(;v),  z^"*^).  A  similar  parallel 
can  be  exhibited  for  Gibbs  sampling.  The  interest  of  the  SEM  alternative  in 
this  perspective  is  that  it  allows  for  working  out  an  estimate  of  6  even  when 
the  distributions  under  consideration  are  not  conjugate.  For  instance,  Chau¬ 
veau  (1991)  makes  use  of  SEM  for  this  reason  when  dealing  with  mixtures 
of  Weibull  distributions. 

The  numerical  simulation  results  of  Celeux  and  Diebolt  show  that  the 
stationary  distributions  'P of  SEM  is  usually  concentrated  around  a  signif¬ 
icant  local  maximum  of  the  likelihood.  In  the  present  paper,  we  address  the 
following  basic  problems  : 

1.  Is  =  Mean{^N)  a  consistent  estimator  of  ^  ? 

2.  What  is  the  order  of  —  6^/,  where  6^  is  the  unique  consistent 
solution  of  the  likelihood  equations  (e.g.,  Redner  and  Walker,  1984)  ? 

3.  Is  the  conditional  distribution  of  —  9^)  given  the  observed 

•sample  X(yv)  =  {xi,...,xyv}  asymptotically  distributed  as  a  normal 
distribution  with  mean  0  and  positive  variance  matrix,  where  Wn  is  a 
random  variable  drawn  from  ? 

Since  the  theoretical  results  on  the  convergence  of  EM  (Wu,  1983  and 
Redner  and  Walker,  1984)  are  essentially  of  a  local  nature  and  the  standard 
asymptotic  Bayesian  theory  cannot  be  used,  Problems  (l)-(3)  appear  rather 
formidable.  Indeed,  we  did  not  find  how  to  treat  them  in  the  general  case 
with  several  local  maxima  as  well  as  saddle- points.  This  is  the  reason  why 
we  focused  on  a  particular  case  where  the  likelihood  function  (l.f.)  is  concave 
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(see  Section  2).  Of  course,  in  such  a  case,  EM  gives  good  results,  and  from 
a  practical  point  of  view,  SEM  is  not  useful.  However,  results  obtained  in 
this  particular  case  have  their  own  theoretical  interest  and  permit  us  to  gain 
insight  into  what  is  really  happening  and  the  mathematics  beyond  Problems 
(l)-(3).  Moreover,  these  results  suggest  what  answers  can  be  expected  in 
more  general  situations. 

In  Section  2,  we  present  the  simple  mixture  model  that  we  will  consider 
throughout  the  paper  and  derive  preliminary  results  about  the  l.f.,  EM  and 
SEM  in  this  particular  case. 

Section  3  is  devoted  to  our  main  result,  stated  as  Theorem  1.  This  the¬ 
orem  gives  affirmative  answers  to  Problems  (1)  and  (3)  for  the  model  un¬ 
der  consideration.  It  also  provides  a  (non-optimal)  estimate  of  the  rate  of 
—  On  (Problem  (2)),  as  well  as  an  estimate  of  the  rate  of  the  conditional 
variance  of  0^"^  given  X(n)-  Furthermore,  the  results  in  Theorem  1  imply 
that  is  an  asymptotically  unbiased  and  optimal  estimator  of  0  and  the 
stationary  rescaled  SEM  sequence  —  On)  converges  in  dis¬ 

tribution,  as  N  oo,  to  the  distribution  of  the  stationary  autoregressive 
sequence  {ZIT^}  defined  by 


=  (1.1) 

where  the  are  Gaussian  i.i.d.  random  variables  with  mean  0  and  vari¬ 
ance  1,  is  dependent  of  Zi°\ . . . ,  Zi"'\  and  r*,  0  <  r*  <  1,  and  cr*,  cr*  >  0, 
are  defined  in  (2.8)  and  (2.19)  in  terms  of  the  complete,  conditional  and  ob¬ 
served  Fisher  information  values,  respectively. 

Section  4  examines  two  different  sequential  versions  of  SEM.  The  “'one- 
step”  version  has  been  implicitly  studied  in  Silverman  (1980),  but  has  its 
asymptotic  efficiency  can  equal  to  zero.  Our  Theorem  3  states  the  a.s.  conver- 
gence  of  the  “globaF  version  and  its  asymptotic  normality.  Since  the  asymp¬ 
totic  variance  can  be  made  explicit,  we  can  examine  in  detail  its  asymptotic 
efficiency,  which  turns  out  to  be  of  the  same  order  as  the  optimal  bound. 
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2  Preliminary  results 


2.1  The  mixture  problem 

Throughout  this  paper,  the  observed  data  X(/v)  =  {^i,  •  •  •  ,xn}  will  be  real¬ 
izations  of  i.i.d.  random  variables  from  the  mixture  density  h{x,p*),  where 


/i(z,p)  =  p/i(x)-|-(l -p)/2(x),  (2.1) 


where  fi{x)  and  /2(a:)  are  known  densities  with  respect  to  a  tr-finite  measure 
p(dx)  on  some  separable  measurable  space  E,  and  the  parameter  p  satisfies 
0  <  p  <  1.  We  will  assume  that  p{x  :  fi(x)  ^  /2(^)}  7^  0  and  that  f\{x) 
and  /2(a;)  are  positive  on  their  respective  supports.  The  statistical  problem 
under  consideration  is  to  find  a  good  estimate  of  p*  on  the  basis  of  ^(n)- 
Before  proceeding,  a  formal  point  has  to  be  made,  since  the  study  of  the 
asymptotic  behavior  as  the  sample  size  iV  — ^  oo  of  a  stochastic  algorithm 
involves  two  different  probability  spaces:  The  sample  space  and  the  sample 
of  pseudorandom  drawings.  We  will  interpret  each  sample  X(jv)  of  size  N 
as  the  projection  on  the  N  first  coordinates  of  a  sequence  x  =  {xi;i  >  1} 
drawn  from  the  product  space  X  =  endowed  with  the  probability 

distribution 

OO 

Px  =  n  h[^i,p‘)dxi.  (2.2) 

«=i 

The  formal  description  of  the  pseudorandom  drawings  is  postponed  to  Sub¬ 
section  2.3. 

Next,  let  us  describe  the  underlying  complete  data  structure  of  the  sta¬ 
tistical  problem  under  consideration.  The  complete  data  is  (X(^),  Z(;»^))  = 
{(xi,2i);z  =  l,...,iV},  where  =  1  or  0  according  as  Xj  has  been  drawn 
from  /i(x)  or  /2(x),  and  the  x,’s  are  independent.  Thus,  each  Zi  is  a  Bernoulli 
r.v.  with  parameter  t*  =  t(x,,  p*),  where 


t{x,p) 


p/i(a?) 

p/i(x)-|-(l  -p)/2(x)’ 


(2.3) 


2.2  The  EM  algorithm 

The  mth  iteration  =  Tn{p^^^)  of  EM  consists  in  the  E-step:  Compute 

=  t{xi,p^”'^)  for  z  =  1, . . . ,  iV,  followed  by  the  M-step:  Compute  = 
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where 


Tn{p)  =  for  P  e  (0,1).  (2.4) 

1=1 

Thus,  letting  Tn{Q)  =  0  and  T/v(l)  =  1,  the  EM  algorithm  indeed  consists 
in  iterating  the  function  Tn  '■  [0, 1]  — ►  [0, 1],  starting  from  an  initial  position 
€  (0, 1).  We  have  the  following  preliminary  results,  w’here 

Ln{p)  ='^iogh{xi,p)  (2.5) 

1=1 

denotes  the  loglikelihood  function  and  the  observed,  complete  and  conditional 
Fisher  information  values  Joba,Jc  and  Jcond-,  respectively,  are  defined  as  in 
Titterington,  Smith  and  Makov  (1985). 

Lemma  2.1  (i)  The  function  Tn{p)  is  increasing  over  [0, 1]. 

(ii)  We  have,  for  all  p  in  [0, 1], 

T„[p)-p  =  p(l-p)i^.  (2.6) 

(Hi)  For  Px-almost  every  x  €  X,  there  exists  an  integer  No  =  A^o(x)  such 
that  L'f,{p)  <  0  for  all  p  in  (0,1)  whenever  N  >  No,  i.e.  the  loglikeli¬ 
hood  is  a  concave  function  for  N  >  No- 

(iv)  If  Pn  is  the  unique  maximizer  of  Lf^{p)  when  N  >  No,  Pn  is  the  unique 
stable  fixed  point  ofTs{p)  over  [0,1],  with 

-•a.  =  n{p„]  =  1  +  pMI  -  e  0.1).  (21) 

and  Tn{p)  >  p  for  0  <  p  <  pn  and  Tn{p)  <  p  for  <  p  <  1. 

(v)  Each  sequence  {p^"*^}  generated  by  EM  starting  from  p^”)  G  (0,1)  con¬ 

verges  to  Pn  with  a  geometric  rate. 

(vi)  The  derivative  r^  ofT^  at  pN  converges  Px-o.s.  to 

r*  =  i-:^  =  :^e(o,i)  (2.8) 

Jc 

as  N  oo. 
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A  brief  proof  of  Lemma  2.1  can  be  found  in  the  Appendix. 


Remark  2.1  -  Lemma  2.1  shows  that,  in  the  simple  incomplete  data  statis¬ 
tical  problem  under  consideration,  EM  does  very  well  if  N  >  No .  Thus,  from 
a  practical  point  of  view,  there  is  no  need  for  any  improved  algorithm  in  this 
particular  case.  However,  we  have  explained  in  Section  1  why  the  asymptotic 
behavior  of  SEM  as  N  —*  oo  in  this  context  deserves  a  careful  study. 

2.3  The  SEM  algorithm 

In  the  present  context,  the  Stochastic  Imputation  Principle  (e.g.,  Celeux 
and  Diebolt,  1987)  produces  the  updated  estimate  as  the  ML  esti¬ 

mate  based  on  the  pseudo-completed  sample  (x(a/), =  {(i,, z  = 
1,...,A^),  where  €  Z(/v)  =  {0,1}^,  each  is  a  Bernoulli  r.v.  with 
parameter  tl”'^  =  t(xi,p^"‘^)  given  by  (2.3)  and  the  i  =  are 

drawn  independently.  This  yields,  in  view  of  (2.4), 

(2-9) 

ta:l 

SO  that  the  random  sequence  is  a  homogeneous  Markov  chain  taking  its 

values  in  {0,  • .  •  ?  !}•  In  order  to  remove  the  absorbing  states  0  and  1, 

we  first  make  choice  of  a  sequence  of  thresholds  c{N),  jj  <  c{N)  <  1  —  c(  A^)  < 
,  such  that  c{N)  — ^  0  as  A  — »  oo,  and  of  a  probability  distribution  Tn 
on  the  set  =  0,1,---,A  and  c(A)  <  ^  <  1  -  c(A)}.  The  SEM 

algorithm  then  proceeds  as  follows.  E-step:  Compute  <(a:,,p^"*))  =  for 
i  =  1,  •  •  • ,  A  using  (2.3).  S-step;  For  i  ^  1,  •  •  • ,  A,  draw  independently  the 
Bernoulli  r.v.’s  with  parameter  and  compute 

(2.10) 

i=l 

If  e  Jyv  =  [c(A),l  —  c(A)],  then  go  to  M-step.  Otherwise,  draw 

plfn+i)  from  the  preassigned  distribution  Fa?  and  go  to  E-step.  M-step: 

p(m+i)  ^  p(m+i)  ^2.11) 
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This  procedure  avoids  being  stuck  at  p  =  0  or  p  =  1,  wherecis  the 
sequence  defined  in  this  way  is  still  a  homogeneous  Markov  chain.  Next,  we 
turn  to  the  ergodicity  of  this  Markov  chain. 

Lemma  2.2  The  homogeneous  Markov  chain  {p*"*^}  generated  by  SEM  is 
geometrically  ergodic  and  the  support  of  its  stationary  distribution  is 
contained  in  Jn. 


The  proof  of  Lemma  2.2  can  be  found  in  the  Appendix. 

Remark  2.2  -  Asa  consequence,  since  {p^*"^}  is  a  finite-state  ergodic  Markov 
chain,  it  is  uniformly  strongly  mixing  with  a  geometric  rate.  Hence,  the  SLLN 
and  a  suitable  version  of  the  CLT  (e.g.,  Davydov,  1973)  apply. 

We  conclude  this  section  by  showing  that  the  sequence  generated  by  SEM 
can  be  viewed  as  a  random  perturbation  of  the  discrete-time  dynamical  sys¬ 
tem  on  [0, 1]  generated  by  EM.  First,  we  need  to  have  a  workable  represen¬ 
tation  of  the  r.v.’s  They  can  be  written  as 


(2.12) 


where  l[o,6](5)  is  the  indicator  function  of  the  interval  [a,  6]  and  the  w/s, 
i  =  1, . . . ,  A^,  are  i.i.d.  random  variables  uniformly  distribued  on  [0, 1]  such 
that  the  sample  is  independent  of  p^°\  . . .  ,p^’"^  We 

have 

+  t/yvCp^"*)),  (2.13) 

where  for  each  p,  0  <  p  <  1 , 


Un[p)  =  JV-‘'=S«(p)w(p,"). 


(2.14) 


where  5a^(p)  >  0,  5^(p)  =  p(l  —  p)TI^{p)  converges  Px-a.s.  as  — >  oo  to 

5^(p)  =p{l-p)J  /i(x)/2(i)^^p(di),  (2.15) 


and  the  r.v.  u  —*  pAf(p,u;)  defined  by 


»7jv(p,w)  =  N  ‘^^5;v*(p)£{llo,<(x.,p)](u;,)  -  <(x„p)},  (2.16) 
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has  mean  0  and  variance  1  for  each  p  in  (0, 1).  With  the  above  representation 
of  the  ^•"‘^’s,  the  probabilistic  setup  of  the  successive  random  drawings  in¬ 
volved  in  the  S-step  of  SEM  can  be  made  precise:  each  can  be  viewed  as 
a  whole  sequence  of  fi  =  [0,  endowed  with  the  product  cr-field  and  the 

probability  Pfi{du})  =  where  A  denotes  the  Lebesgue  measure  on 

[0, 1],  whereas  only  involves  the  first  N  coordinates  . .  • , 

of  the  sequence  We  will  denote  by  Eii{Y)  the  expectation  of  any  r.v. 

^(a;)  with  respect  to  this  probability  P^.  For  instance,  Eii{T)N{p,u}))  =  0 
and  Eii{ri%{p,uj)}  =  1  for  all  p  in  (0,1). 

Since  the  CLT  implies  that,  for  all  p  in  (0,1)  and  Px-a.e.  x,  p.v(p,uj) 
converges  in  PQ-distribution  as  »  oo  to  a  Gaussian  r.v.  c(uj)  with  mean 
0  and  variance  1,  we  can  expect  that,  for  large  N,  (2.10)-(2.14)  can  be  ap¬ 
proximated  by 

p('"+U  «  r/v(p('"))  +  (2.17) 

with  =  c(u;^'"^),  so  that,  if  we  can  show  that  the  stationary  measure 
of  {p^"*^}  is  well  concentrated  around  p/v,  then  (2.17)  turns  out  to  be 
approximately 

p(-+^)  «  PS  -h  rN(p<"‘)  -  Pn)  +  iV-'/25^(p('”))e<'").  (2.18) 

Furthermore,  since  rs  —*  r*  and 

=  SUpn)  ^  <t'"  =  S\p-)  =  p-(l  -  p-)r-  =  (2.19) 

as  N  —>  OO, 

p(-+5)  -  pyv  «  r*(p<’")  -  pn)  +  (2.20) 

so  that,  in  general,  =  p(’"‘*'2)  if  pl”*l  remains  near  ps  most  of  the  time. 

If  we  can  make  the  approximations  (2.17)-(2.20)  precise  and  uniform  with 
respect  to  p*"*^  in  a  suitable  sense  and  show  that  p^”*^  remains  near  ps,  then 
we  will  have  essentially  proved  Theorem  1,  which  we  will  now  state  and  prove. 
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3  Main  Results 

3.1  Theorem  1 

Before  stating  our  Theorem  1,  we  need  to  introduce 

R{N)  =  R{x,N)=  sup  .  (3.1) 

p&Jn,p^pn  P  Pn 

It  follows  from  the  results  in  Subsection  2.2  and  the  Appendix  that  0  < 
R{N)  <  1  Px  —  a-s.  for  N  large  enough,  and  1  —  R(N)  — >  0  as  c(A^)  — >  0. 
Furthermore,  the  rate  of  convergence  to  0  of  1  —  R{N)  can  be  arbitrarily  slow 
provided  that  the  rate  of  c{N)  as  been  chosen  slow  enough. 

Theorem  1  Suppose  that  the  followino  assumptions  (Hl)-(H4)  hold. 

(HI)  The  densities  f\{x)  and  /2(x)  satisfy  p{x  ;  f\{x)  ^  f2(x)}  ^  0. 

(H2)  The  probability  distribution  Tn  (see  Subsection  2.3)  used  to  drau  the 
updated  SEM  estimator  when  is  not  in  Jyv  is  the  Dirac  mea¬ 

sure  at  some  j{N)/N  €  {c{N),  1  —  c{N)),  1  <  j{N)  <  N  —  1. 

(H3)  Nc{N)  —*  oo  as  N  —*  oo. 

(H4)  N{1  -  R(N)y  -^oo  asN 
Then 

(i)  If  W]^  is  a  r.v.  from  the  stationary  distribution  of  SEM,  then 
7V‘/2(VF/v  —  Pa?)  converges  in  distribution  as  N  oo  to  a  Gaussian  r.v. 
with  mean  0  and  variance  v*  =  cr*^/  (1  —  r*^),  where  r*  =  JcondiJc  € 
(0,  1)  and  <7*^  =  p*(l  -  p*)r*  =  JcondiJc- 

(ii)  For  all  N  large  enough, 

\pT  -  Tn\  <  N-'IMN)  +  O  ,  (3.2) 

where  p’j^^  =  Afean(^A?)  ond 

a{N)  =  0{a(N) 6{N)),  (3.3) 

a{N)  =  O  (iV-'/>{l  -  /i(W))-')+0 

(3.4) 
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and 


(3.5) 


S{N)  =  O  (jr^^  -  r*l  +  |(7Ar  -  <7*|) . 

Furthermore,  if  a(A^){l  —  R{N)}~^  0  as  N  —*  oo,  then 

lV'ar(*^,)- ^1  =  0(1).  (3.6) 

(Hi)  If  PN{t),  — oo  <  t  <  oo,  denotes  the  d.f.  of  and  denotes  the 
normal  d.f.  with  mean  and  variance  vj^jIN  = 
then,  for  all  tq  such  that  0  <  p’  —  Tq  <  p*  +  tq  <  1 , 

sup  |P„(()-4.n(()I  =  o(5-^')  (3.7) 

ti\}-'—ro,p'+To]  \  ™  / 

sup  |Pa,(<)  -  $A/(OI  =  O  .  (3.8) 

t€[p*-To,p*+To] 


Remark  3.1  The  assumption  (H2)  can  he  greatly  relaxed  to  allow  for  more 
general  distributions.  It  suffices  to  make  proper  choices  of  Sn{p}  and 
^j\{{p,u)  for  p  ^  but  for  these  more  general  T^’s  the  proof  of  Theorem  1 
is  more  involved.  Here,  we  have  stated  Theorem  1  under  (H2)  for  clarity. 

Remark  3.2  The  assertion  (i)  tells  us  that  the  stationary  distribution  ^;v  of 
SEM  in  the  particular  mixture  context  detailed  in  Section  2  is  asymptotically 
normal  with  mean  pj\/  and  variance  v*  jN ,  where  V-  =  J;,]{\  +  {Jc/Jcon^)}-^  < 
l/{2Job,).  Thus,  the  variance  of  a  sample  :  m  =  of  the 

stationary  SEM  sequence  is  only  a  fraction  of  the  variance  If  Jobs  of  the  ML 
estimator  pi^i .  This  is  natural  since,  as  explained  in  Section  1,  SEM  can  be 
roughly  viewed  as  a  particular  version  of  the  Gibbs  sampler,  where  the  step 
of  simulation  of  6'^'*'^  ~  7r(0|x(;v),z^’"O  replaced  by  the  updating  = 

Mean  of  7r(^|x(jv),  which  reduces  the  variance  of  the  generated  sequence. 

Remark  3.3  The  assertion  (ii)  entails  that  the  SEM  estimator  =  Mean 
{'^n)  is  asymptotically  unbiased  and  its  sample  variance  is  equal  to  the  sample 
variance  of  the  ML  estimator  p/si  up  to  a  term  of  the  order  of  a{N)/N  = 
o{l/N).  Thus,  is  asymptotically  optimal. 
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Remark  3.4  From  Lemma  A. 4  in  the  Appendix,  it  ‘results  that  S{N)  = 
0{£{N)/\/N)  Px—a.s.,  where  i(N)  =  and  ^2  denotes  the  iterated 

logarithm. 

Remark  3.5  Theorem  1  (i)  can  be  directly  generalized  to  the  case  where  the 
mixture  h{x,p)  =  Ei<  k<KPkfki^)  has  K  >  3  components  and  the  parameter 
to  be  estimated  isp  =  (pi,  • '  •  iPK-i)-  In  contrast,  the  assertion ''[ii)  and  (Hi) 
do  not  seem  to  be  easily  extendable  via  our  method  of  proof. 

Remark  3.6  Theorem  1  suggests  that  similar  results  hold  in  the  context  of 
general  mixture  problems,  and  even  in  the  general  missing  or  incomplete  data 
context  under  reasonable  assumptions.  The  only  general  result  in  this  per¬ 
spective  is  a  theorem  in  Celeux  and  Diebolt  (1986b)  which  states  essentially 
that  the  assertion  (i)  holds  under  the  restrictive  assumption  that  the  EM  op¬ 
erator  Tn{6)  has  only  one  fixed  point  in  the  compact  Gn  corresponding  to 
the  interval  Ji\f,  and  that  this  unique  fixed  point  is  stable.  This  theorem  sup¬ 
ports  the  conjecture  that  (i)  holds  in  a  rather  general  context,  since,  although 
Tn{0)  has  many  fixed  points  whenever  Gn  is  reasonably  large,  the  unique 
consistent  estimator  becomes  prominent,  whereas  the  other  fixed  points  of 
7/v(^)  fluctuate  and  fade  away  as  N  —*  00.  In  a  very  loose  sense,  this  means 
that  Ts{0)  has  asymptotically  a  unique  fixed  point  in  Gn,  which  turns  out  to 
be  stable  since  it  is  a  maximum  of  the  if. 

Remark  3.7  Note  that  an  alternative  to  the  SEM  algorithm  is  the  SAEM 
algorithm  (Celeux  and  Diebolt,  1992),  which  is  somewhat  in  the  spirit  of  sim¬ 
ulated  annealing.  Celeux  and  Diebolt  (1992)  show  that,  for  any  given  sample 
X(Af)  with  N  large  enough,  SAEM  converges  a.s.  to  a  local  maximum  of  the  if. 
in  the  context  of  general  mixtures  of  densities  from  some  exponential  family, 
under  reasonable  assumptions  concerning  the  fixed  points  of  Tn{6)  in  Gn. 
Furthermore,  Biscarat  (1992)  establishes  a  more  general  result,  which  allows 
to  take  care  of  other  important  incomplete  data  settings.  See  also  Biscarat, 
Celeux  and  Diebolt  (1992)  for  a  similar  theoretical  study  of  a  simulated  an¬ 
nealing  type  version  of  the  MCEM  algorithm  introduced  in  Wei  and  Tanner 
(1990). 
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3.2  Proof  of  Theorem  1 


We  begin  with  a  brief  outline  of  the  proof  of  Theorem  1.  In  Part  I,  we 
establish  results  analogous  to  (i)-(iii)  for  the  auxiliary  sequence 

V(m)  ^  ^  ^^(m)  _  ^  (3  9) 

where  the  homogeneous  Markov  chain  is  recursively  defined  by 

g(m+i)  ^  ^  ^  ,  (3.10) 

j{N)  . 

where  7W(p)  =  Tn{p)  for  p  £  Jn  and  7jv(p)  =  some  in  for  p  ^  Jr. , 

^Nip)  =  Srr{p)  for  p  e  Jn  and  Srr(p)  =  0  for  p  ^  Jrr,  and  (rj{p,u!)  =  rjNip,i*j) 
for  p  e  Jn,  4n(p,w)  =  7/A^(c(iV),a;)  for  0  <  p  <  c{N)  and  iN{p,uj)  = 
rjN{l  —  c{N),u})  for  1  —  c{N)  <  p  <  1.  We  first  show  that  is  ergodic 

(Step  1)  with  stationary  distribution  denoted  by  An-  Then,  we  derive  an 

upper  bound  for  E(i  introducing 

0<R{N)=  sup  <  1,  (3.11) 

P€Jn.P^Ps  P  ~  Pn 

in  Step  2.  In  Step  3,  we  deduce  from  technical  results  about  Tn{p)  and  Sn{p) 
(Lemmas  3.3  and  3.4)  and  from  the  Skorohod  representation  together  with 
bounds  related  to  the  Berry-Esseen  Inequality  (see  Lemma  3.5)  an  upper 

bound  for  E  where  =  y^®)  and 

(3.12) 

with  a  Gaussian  r.v.  with  mean  0  and  variance  1,  independent  from 
Z^°\  . . . ,  In  Step  4,  we  deduce  from  Lemma  3.5  results  for  y^"*^  anal¬ 
ogous  to  (i)-(iii)  and  an  upper  bound  for  An{Jn),  where  J^  denotes  the 
complement  of  Jn  in  [0, 1]. 

In  Part  II,  we  show  how  to  obtain  from  these  results  corresponding  upper 
bounds  for  Eq  where 

y^"*)  =  v/TV  (p^”*)  -  pyv)  .  (3.13) 
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PART  I 


Step  1  First,  we  have  to  make  sure  that  is  ergodic.  This  is  the 

purpose  of  Lemma  3.1  below. 

Lemma  3.1  Tht  homogeneous  Markov  chain  defined  by  (3.10)  is 

ergodic.  Moreover,  its  stationary  distribution  Ajv  has  all  its  moments  finite 
if  Nc{N)  oo  as  N  oo. 

The  proof  of  Lemma  3.1  parallels  that  of  Lemma  2.2  above,  and  can  be 
found  in  the  Appendix. 

Step  2  We  need  the  following  upper  bound  for  Efi  ^  ^  involving  R{N) 

defined  by  (3.11). 

Lemma  3.2  Assume  that  Nc{N)  —*  cx)  as  N  oo,  and  either  =  0 
or  is  in  its  stationary  regime.  Then,  for  Px  -  a.e.  x,  there  exists  a 

finite  integer  Ni{x)  such  that  N  >  Ni{x)  implies 

where  ||V||p  =  forp>l. 

Proof.  From  (3.9)-(3.10)  and  the  Minkowski  Inequality,  it  follows  that 

<  m)  1|k<”'||^ + i  ||{a,  ,  (3.15) 

since  0  <  Ss{p)  <1/4  for  all  p  in  (0, 1).  Now,  the  same  calculation  as  in  the 
proof  of  Lemma  3.1  (Appendix)  shows  that,  for  all  p  in  [0, 1], 

(3.16) 

<  1  q.  - \ - 

-  ^  2NciN)r'i^' 

where  =  inf  |T)(^(p)|  r'  >  0  for  almost  every  x  as  N  oo  (Appendix). 
Since  Nc{N)  —*  oo  as  N  —*  oo,  the  result  follows.  □ 
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step  3  Before  establishing  Lemma  3.5,  which  is  the  core  of  the  proof,  we 
need  two  additional  technical  results,  namely  Lemmas  3.3  and  3.4  below, 
whose  proofs  are  postponed  to  the  Appendix  for  clarity. 

Lemma  3.3  (i)  ForPx  ~  a-e.  x,  there  exists  a  finite  integer  A^2(x)  >  A^i(x) 
such  that  N  >  A^2(x)  implies 

|rj;(p)l  <  -77^  for  all  p  in  (0,1).  (3.17) 

P(1  -p) 

(ii)  Let  0  <  £o  <  p"  be  given.  Then,  for  Px  -  a.e.  x,  there  exists  a  finite 
integer  N3{x)  >  N2{x.)  such  that  N  >  lV3(x)  implies  that  |p)v  —  p"!  <  £o 
[p*  —  £0)  P*  +  £o]  Is  contained  in  Jn  and,  for  all  h  such  that  pn  +  h  ^  [0, 1], 
we  have 

\Tn{pn  +  h)  —  pn  —  hr^l  <  (3.18) 

for  some  positive  constant  Aq. 

Lemma  3.4  Under  the  same  assumptions  as  in  Lemma  3.3,  there  exists  a 
positive  constant  Bq  and,  for  Px  -  fl-e.  x,  a  finite  integer  A^4(x)  >  A^3(x) 
such  that  N  >  N4{x)  implies 

\Sn{pn  +  h)  -  Sn{pn)\  <  Bo\h\  (3.19) 

for  all  h  such  that  ps  +  h  £  [0, 1]. 

The  next  lemma  is  the  core  of  the  proof  of  Theorem  1.  Using  Skorohod’s 
(1956)  representation  argument  and  the  Berry-Esseen  Inequality,  it  provides 
a  basic  upper  bound  for  £'a(|^n(p,  «)  —  £^(“)P)  uniformly  in  p  G  [0, 1],  where 
(n{P,u)  and  e(u)  denote  the  Skorohod  representations  of  ^n(P)W)  and  £(a;), 
respectively,  as  defined  below. 

Lemma  3.5  Assume  that  Nc{N)  —*  oo  as  N  oo.  Then,  there  exists 
a  probability  space  (U  =  {u  =  (vi,U2,  •■•,)};  Pu)  r.v.^s  ^iv{p,Ui)  and 
£{ui),  i  =  1,2, . . . ,  defined  on  this  probability  space,  such  that  : 

(i)  ^NiP^ai)  and  £{ui)  have  the  same  distributions  as  ^n{p,lj)  and  £{u}), 
respectively,  for  each  i  =  1,2,  •  •  •  and  all  p  in  [0, 1). 

(ii)  For  each  fixed  N  and  p,  the  r.v.  ’s  (.N^p, «,),  i  =  1, 2, . . . ,  are  i.i.d.  and  the 
r.v.’s  £{ui),  i  =  1,2,...,  are  i.i.d.  and  Gaussian  with  mean  0  and  variance 
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1. 

(in)  For  Px  -  a.e.  x,  there  exists  a  finite  integer  Ns{x)  >  A'4(x)  such  that 
N  >  Ns{x)  implies,  for  all  i  =  1,2, . . . ,  and  p  in  [0, 1], 

Eu(l«Ar(p,-0  -  £(<‘i)l")  <  107'''‘(JV)log'/'  (3.20) 

where^iN)  =  0((iVc(iV))-i/2). 

Proof.  We  begin  with  some  notation.  For  any  distribution  function  (d.f.) 
F(t),  -oo  <  t  <  oo,  let  F~^(u)  =  inf{f  :  F(t)  >  u},  u  E  (0,1),  denote 
the  corresponding  inverse  function.  Let  Fjv(p,t)  denote  the  d.f.  of  ^/v(p,u;) 
and  $(t)  denote  the  standard  normal  d.f.  The  function  ^jv(p,u)  =  inf{t  : 
Fjv(p,t)  >  u)  and  £(«)  =  u  £  (0,1)  are  r.v.’s  on  the  probability 

space  ([0,1],  5(0, 1],  A(dit)),  where  B[0, 1]  is  the  Borel  a-  field  of  [0,1]  and 
X(du)  is  the  Lebesgue  probability  measure  on  [0, 1],  with  the  same  distribu¬ 
tions  as  iN(p,‘^)  and  £(io),  respectively.  For  any  fixed  p,  the  CLT  implies 
that  ^n{p,u)  -»  e(u)  X{du)  -  a.s.  as  iV  -»  oo. 

On  the  other  hand,  the  Berry-Esseen  Inequality  (e.g.,  Shorack  and  Well- 
ner  (1986,  p.  848))  implies  that,  for  all  p  in  [0, 1], 

<  t(A'), 

where  'f{N)  =  CBE[r'A^c(jV)]“^/^,  with  r'  as  in  the  Appendix,  if  N  is  large 
enough.  Here,  Cbe  denotes  the  absolute  positive  constant  involved  in  the 
Berry-Esseen  Inequality. 

Now  let  B{N,p),  p  €  [0,1],  be  the  subset  of  [0,1]  defined  by  B{N,p)  = 
{«  €  (0,1)  :  \(n{p,u)  —  e(u)|  >  \/j{N)].  Owing  to  Shorack  and  Wellner 

(1986),  Ex.  7  p.  65,  we  deduce  from  (3.21)  that  X{B{N,p)}  < 
uniformly  in  p. 

We  are  now  in  a  position  to  derive  (3.20).  First  note  that,  since  E\{e^)  = 
^^a{^a^(p,-)^}  =  we  have 

£^A(MP,-)-eh<2£A{|£|  \e-UP.-)\}-  (3.22) 
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But,  by  the  Cauchy-Schwarz  Inequality,  and  the  definition  of  £(u), 

I  \e{u)\  \e{u)  -  ^N{p,u)\du  <2(  [  (3.23) 

Jb(,n,p)  \Jb(n,p)  ) 


whereas 


I  \4u)\\eiu)-(N{p,u)\du<yfi{^.  (3.24) 

Jb{N,p)‘^  '' 


In  order  to  obtain  a  workable  upper  bound  for  the  RHS  of  (3.23),  we 
note  that,  since  $''^(«)  is  increasing  on  (0, 1)  and  symmetric  about  1/2,  the 

integral  /  {$“*(«) is  lesser  than  the  integral  /  {$~^(u)}^du, 

Jb{N,p)  Jc(n,p) 

where  C{N,p)  has  the  same  Lebesgue  measure  as  B{N,p)  and  is  the  union 

of  (0,a]  and  [1  —  a,  1)  for  some  a,  0  <  a  <  1/2.  Since  A{5(A'^,p)}  < 
it  follows  that 


^OO 

=  2  /  t^ip{t)dt, 

Jbim 


where  b{N)  =  ^ 


— 


1  - 


^  j  and  ip{t)  =  {2x)  ^^^exp 
gration  by  parts  shows  that  the  RHS  of  (3.25)  is  equal  to 

2  {(2jr)-''^6(Af)exp  +  (1  -  i'XMJV))}. 

Thus,  for  b{N)  >  1, 


(3.25) 


An  inte- 


< 


'  ^  b{N)  '  V  'V'-/ 

2b\N)  ^1  +  (1  -  ^){b{N))  + 


<  AP{N){l-^){b{N))  +  ,fi{N) 

<  {2bHN)  +  l)y^^). 


(3.26) 
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Finally,  for  b{N)  >  1,  =  2(1  -  ^){b{N))  <  implying 

that 

log  >  b\N)  +  2  log(6(/V))  +  log  (I)  >  6=(  jV),  (3.27) 

from  which  it  follows  that 

Putting  (3.21)-(3.28)  together,  we  obtain  for  all  N  large  enough  that 

Ex{\(Nlp,.)-e?)  <  +2-i'/HN).  (3.29) 

□ 

Thus,  to  conclude  the  proof  of  Lemma  3.5,  it  suffices  to  take 

OO 

u=(o,ir,  Pu(<iu) = n -'(<iBi), 

i=l 

C/v(p,«i)  =  inf{<  :  FW(p,<)  >  Ui}  and  c(u.)  =  $~^(u,). 


Step  4  We  now  compare  the  Skorohod  version  =  \/N{q^'^^{u)—pN) 

of  (3.9)  to  the  autoregressive  linear  Gaussian  process 

Z(”*+i)(u)  =  rA?Z<”*)(u)  +  (7A,e<”*)(u),  (3.30) 

where  =  SsiPN)  and  c^’"^(u)  =  c(uto)  and  to 

2^(m+i)(^j)  ^  r*Zl”‘>(u)  +  (T*e(’")(u),  (3.31) 

where  r*  and  a*  have  been  defined  in  (2.8)  and  (2.19),  respectively.  In  order 
to  achieve  these  comparisons,  we  suppose  that  V^'"^(u)  is  in  its  stationary 
regime,  that  Nc{N)  -*  cx>  as  N  —*  oo  and  that  N  >  Ns{x)  as  defined  in 
Lemma  3.5.  For  simplicity,  we  will  suppress  the  argument  u  in  the  notation 
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and  write  «m)- 

From  Lemmas  3.3  and  3.4  it  follows  that 

|V/(-+i)  _  (rKV'l”)  +aNd"’)l  <  ^|V''”’P  +  ^|V‘”"lld”'’l  (3.32) 


which  implies 

j|  y(-+i)  _  ^(m+i)  1I2  <  rs  II  Ft"*)  -  Zt"*)  II2  +a(iV),  (3.33) 


where 


A  AD 

^  s/N{\-R{N)Y  y/N{\-R[N)]  ^ 


with 


If  we  select  Zt“)  =  F^o)^  it  follows  from  (3.33)  that 


Ft"*)  -  Zt"*)  II2  < 


a{N) 

\  —rs' 


with 


zt"*)  =  (r/v)t"*)Ft°)  +  as 


(3.34) 


(3.35) 


(3.36) 


(3.37) 


j=i 


If,  moreover,  we  select  Zl°)  =  Zt°)  =  Ft®),  we  obtain  from  (3.30)-(3.31)  and 
(3.33) 

II  v(m)  _  ^(m)  11^  <  a{N)  -^6{m,N)^ 


1  - 


where 


(r*) 


^(m,  N)  =  |r^  -  r*|  I  ^  +  k;v  - 


R{N)  l-r* 

Finally,  if  we  select  for  each  N  large  enough  an  integer  m{N)  such  that 


(3.39) 


p{N)  = 


1  -  R{N) 


0  as  N  —*  00 


(3.40) 
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and  denote  S{N)  =  S{m{N),  N)  then  we  obtain  a  r.v.  Vjv  =  with 

distribution  An  and  a  Gaussian  r.v.  Zn^ 

m(N) 

Zyv  =  a*  X!  (3.41) 

i=i 


with  mean  0  and  variance  t;*{l  —  such  that 


II  Vn  -  Zn  H:  <  +  p(N). 

1  -Tn 

(3.42) 

Since  m{N) 

can  be  taken  arbitrarily  large,  we  can  assume  that 

II  Vn  -  Zn  h  <  a{N), 

(3.43) 

where 

1  -  rjv 

(3.44) 

and 

(3.45) 

Step  5  Before  proceeding  to  Part  II,  we  deduce  from  the  results  obtained 
above  several  properties  of  the  asymptotic  behavior  of  A  at  from  which  the 
assertions  (i)-(iii)  of  Theorem  1  will  be  derived  in  Part  II. 


1.  In  view  of  (3.42)-(3.45)  and  the  convergence  in  distribution  of  Zn  to 
a  Gaussian  r.v.  with  mean  0  and  variance  u*,  it  results  from  Lemma 
2,  p.  254,  Feller  (1971)  that  Vn  converges  in  distribution  to  the  same 
limit. 

2.  Since  is  assumed  stationary. 

Mean  (An)  = 

=  £„  {p,  +  (3 

=  PN  +  N-^/^EuiVN) 
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(3.47) 


by  taJcing  m  =  m{N).  Thus 

lMean(AN)-pN|  <  N-^fmEv{ZN)\+ \\  -  Zn  \\x} 


<  N-^I'^\\Vn-Zn\\2 

<  N-'l'^a{N), 


since  Zn  has  mean  0  and  by  making  use  of  the  Cauchy-Schwarz  In¬ 
equality.  It  can  be  proved  similarly  that 

IVar  (A„)  -  ^1  =  0(1)  (3.48) 

provided  that  the  rate  of  convergence  of  c(  AT)  to  0  is  such  that  a{N)l  {\  — 
R{N)}  converges  to  0. 

Similar  bounds  can  be  derived  for  higher  moments. 


3.  Let  Qsit)  =  <  <}  be  the  distribution  function  of  in  its 

stationary  regime  and  ^m,N{t)  =  Aj{pn  +  <  0- 

From  (3.36)  and  Chebyschev  Inequality  it  results  that,  for  all  real  t  and 
positive  h, 


|Qn(0  -  ^m,N{t  +  h)\  < 


]_ 

*2 


a{N) 


1  —  ri^ 


Letting  m  — »  oo,  we  obtain  that,  for  all  real  t, 

I 


|Qyv(0  -  ^iv(OI  ^ 


/l2 


ol{N) 


12 


1  -  Tat 


+ 


sup  ipAf 


(3.49) 


(3.50) 


where  ^^(t)  is  the  normal  distribution  function  with  mean  and 
variance  <T^/{iV(l  —  r^)}  and  <PAr(t)  is  the  corresponding  density.  A 
proper  selection  of  h  =  h/v  in  (3.50)  yields  for  the  sup-norm  H  Qn  — 

Iloo  : 

^  J  (3.51) 

=  0{a^^^{N))  as  N  —*  oo. 
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4.  Finally,  we  will  use  in  Part  II  of  the  proof  of  Theorem  1  an  upper  bound 
for  where  =  [c(iV),l  —  c{N)]  and  is  the  complement 

of  Jn  in  [0,1].  Since  the  measure  is  concentrated  on  [0,1],  the 
distribution  function  of  Ayv  satisfies  QAf(O)  =  0  and  (5a'(1)  =  1,  and  we 
haveAiv(^^)  =  <?A^(c(Ar))-|-(l  —  Q;v)(l  —  c(iV)).  Now,  using  inequality 
(3.50)  with  a  proper  choice  o{  h  =  Hn  yields 


Aa^(  =  O 


a\N) 


(3.52) 


PART  II 

In  order  to  derive  (i)-(iii)  of  Theorem  1  from  points  (l)-(3)  of  Step  5  of 
Part  I,  we  will  first  show  that  \n(B)  <  ^n(B)  for  all  Borel  subsets  B  of 
Jyv-  To  this  end,  we  consider  the  Skorohod  representation  of  <31^'"^  and 
assuming  that  €  J/y-  Let  k({£  >1)  denote  the  £  th  exit  time  of 

p(”‘+3)  from  Jtv-  For  0  <  m  <  and  0  Jj^, 

Since  f^iq)  =  j(N)fN  €  Jn  and  Ssiq)  =  0  for  ?  ^  Jiv, 

Also,  since  jg  drawn  from  Pjv  and  P^v  is  assumed  to  be  the  Dirac 

measure  at  j{N)/N^  p(*i+i)  =  ^(*1+2)  =  j(N)/N.  An  induction  shows  that, 
for  £  >  0, 

g(m)  ^  p(m-o  if  kt  +  £<m<k(+r+£ -I,  (3.53) 

with  the  convention  ko  =  0.  Hence,  if  /a(^^'"^)  =  1,  where  Ib(-)  is  the 
indicator  function  of  R  C  Jn,  then  there  exists  m'  <  m  such  that 
1,  implying  that 


1  M  1  Af 


for  all  M.  But,  by  ergodicity,  letting  M  — ♦  00  in  (3.54)  yields 

An(R)  ^  ^j\/-(B)  for  allB  C  Jn- 


(3.54) 


(3.55) 


Next,  since  is  concentrated  on  Jn,  it  follows  fronx  (3.55)  and  An([0,  1])  = 
^n([0,  1])  =  1  that  the  total  variation  ||  "^n  —  An  llrv  satifies 


’^N  —  An  ||rv<  2An(J^) 


(3.56) 
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which  implies  (i)  and  (iii),  in  view  of  points  (1)  and  (3)  in  Part  I.  Finally, 
since  and  are  in  [0, 1],  (3.56)  implies  that 

1  Mean{<ifN)  -  Mean(Ajv)  [  <  /o  t  |  |  {dt) 

=  O(^), 

and,  similarly, 

I  VarCCA,)  -  |=  0(— ^),  (3.58) 

thus  (ii)  is  obtained  from  point  (2),  which  completes  the  proof  of  Theorem 

4  Sequential  Versions  of  SEM 

In  this  section,  we  turn  our  attention  to  two  different  sequential  versions  of 
SEM.  Sequential  procedures  are  used  when  the  observations  arj, . . . ,  x„, . . .  are 
received  one  at  a  time  and  the  estimation  of  the  mixture  parameters  have  to 
be  updated  before  the  next  observation  is  received.  The  Chapter  6  in  Titter- 
ington.  Smith  and  Makov  (1985)  provides  a  thorough  examination  of  sequan- 
tial  methods  in  the  mixture  setting.  Great  attention  is  paid  to  the  particular 
problem  of  estimating  the  mixing  proportion  p*  for  two-component  mixtures 
where  the  component  densities  are  assumed  known.  This  is  the  problem  that 
we  address  in  this  section.  Titterington,  Smith  and  Makov  review  the  main 
approaches  to  this  problem,  namely,  the  decision-directed  method  (Davisson 
and  Schwartz,  1970),  the  Quasi-Bayes  method  (Makov  and  Smith,  1977),  the 
probabilistic  editor  method  (e.g.,  Owen,  1975),  the  method  of  moments  (e.g., 
Odell  and  Basu,  1976),  a  Newton-Raphson-type  gradient  algorithm  for  find¬ 
ing  the  minimum  of  the  Kullback-Leibler  divergence  (Kazakos,  1977)  and 
the  probabilistic  teacher  method  (Agrawala,  1970,  Silverman,  1980).  This 
last  approach  is  nothing  but  a  one-step  sequential  version  of  SEM  and  is  the 
object  of  Subsection  4.1  below.  Moreover,  Titterington  (1984)  introduces 
a  general  recursive  method  which,  in  the  particular  mixture  problem  under 
consideration,  can  be  written  as 

J,(™+I)  =  pW  +  („  +  (4.1) 

where  Jc{p)  =  [p(l  —  p)]~^  is  the  complete  data  Fisher  information  for  a 
single  observation  from  h{x,p)  =  p/i(a^)-(-(l  —  p)/2(^)  and  S(z,p)  =  [/i(z)  — 
/2(x)]//i(z,p)  is  the  score  (5/5p)  log  ft(i,p). 
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As  noted  in  Titterington  et  al.  (1985),  p.  184:  “A  convenient  way  to 
approach  the  study  of  the  asymptotic  properties  of  the  various  proposed 
procedures  is  through  the  theory  of  stochastic  approximation  which  exploits 
the  martingale  strucure  implicit  in  the  recursions  involved  in  these  methods” 
(see  (4.1),  for  instance).  Thus,  the  consistency  of  the  estimators  derived  from 
the  Quasi-Bayes  method,  the  Kazakos  algorithm,  the  probabilistic  teacher 
method  and  the  Titterington  algorithm  have  been  proved  using  results  from 
martingale  theory. 

A  good  measure  of  the  efficiency  of  a  sequence  of  estimators  }  is  its 
asymptotic  relative  efficiency  defined  as 


ARE  =  lim 

n-^oo 


mV' 

Var{p^”'^) 


(4.2) 


where  V'  =  is  the  Cramer-Rao  lower  bound.  Kazakos  (1977)  has  de¬ 
signed  his  algorithm  to  be  fully  efficient,  i.e.  to  have  ARE  =  1.  But  his 
scheme  requires  numerical  integration  which  are  computationnally  unattrac¬ 
tive.  Now,  it  is  a  striking  fact  that  the  ARE’s  of  the  Quasi-Bayes,  the  prob¬ 
abilistic  teacher  and  the  Titterington  algorithms  are  positive  iff  the  ratio 
JohslJc  >  1/2. 


4.1  The  one-step  sequential  SEM  algorithm 

This  is  the  standard  sequential  version  of  SEM.  Each  time  a  new  observation 
is  received,  only  the  classification  Zm+\  is  drawn  at  random  from  the 
current  posterior  probability  <(i„,+i,pf’"^).  There  is  no  feedback  as  to  the 
correctness  of  previous’decisions;  The  other  z^-’l’s,  1  <  j  <  m,  are  kept  con¬ 
stant.  More  formally  the  one-step  sequential  SEM  works  as  follows.  Denote 
by  >1)  the  estimate  of  p*  computed  on  the  basis  of  the  observa¬ 

tions  ii, . . .  ,x„,  and  by  p^°^  the  starting  point  of  the  algorithm.  After  Xm-n 
htis  been  received,  the  E-step  consists  of  computing  the  posterior  probability 
tm+i  =  ^(^m-n,?^”*'),  according  to  (2.3).  The  S  step  draws  the  r.v. 
from  a  Bernoulli  distribution  with  parameter  fm-t-i-  The  M-step  updates  pf'"^ 
as 

p(m+l)  ^  p(m)  ^  - P -  ^4  3^ 

m  +  1 

As  noted  above,  the  recursion  (4.3)  can  be  viewed  as  the  probabilistic  teacher 
algorithm  of  Agrawala  (1970).  Thus,  the  results  obtained  by  Silverman 
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(1980)  can  be  transferred.  These  results  are  .ummarized  in  the  Theorem 

2  below. 

Theorem  2  (Silverman  1980)  (i)  — +  p*  a.s.  as  m  — >  oo. 

(it)  The  ARE  of  the  one-step  sequential  SEM  algorithm  is  equal  to 

ARE  =  max(0, 2 — ~)  (4.4) 

*^063 

4.2  The  global  sequential  SEM  algorithm 

In  this  subsection,  we  inherit  the  notation  of  Sections  2  and  3.  The  main 
object  of  this  subsection  is  to  explain  and  prove  Theorem  3.  This  theorem 
concerns  convergence  properties  of  the  particular  sequential  version  of  SEM 
that  we  call  the  global  sequential  SEM  algorithm.  This  version  works  as  fol¬ 
lows.  Denote  by  p^'"^(m  >  1)  the  estimate  of  p*  computed  on  the  basis  of  the 
observations  ii, . . . ,  and  by  p^°)  the  starting  point  of  the  algorithm.  After 
the  (m  -|-  l)th  observation,  Xm+i,  has  been  recorded,  the  E-step  updates  the 
posterior  probabilities  as  =  <(x,-,p(”*l), i  =  1, . . .  ,m,  and  computes  the 

new  posterior  probability  =  ^(3^m+i,P^’"^),  according  to  (2.3).  The  S- 

step  draws  independently  each  r.v.  =  l,...,m  -|- 1,  from  a  Bernoulli 

distribution  with  parameter  The  M-step  updates  p*”  as 

=  t„+i(p("‘))  -1-  (m  +  l)->5,„+,(p(”’))e‘'”+»,  (4.5) 

where  =  ^„,^.i(p^'"l,u;^’"‘^'^). 

The  important  difference  with  the  one-step  sequential  SEM  algorithm  is 
that  all  the  observations  are  again  randomly  attributed  to  one  of  the  compo¬ 
nents  of  the  mixture  after  each  new  observation  has  been  recorded.  Accord¬ 
ingly,  there  are  more  and  more  computations  involved  in  the  m  th  iteration 
as  m  increases.  On  the  other  hand,  one  can  expect  that  the  convergence  to 
p*  will  be  much  quicker.  This  is  suggested  by  the  aissertion  (iv)  of  Theorem 

3  below,  which  states  that  the  ARE  of  the  global  sequential  version  of  SEM 
is  positive  (at  least  under  the  assumption  that  c(A^)  and  /?(A')  be  constant). 
On  the  contrary,  the  ARE  of  the  one-step  sequential  version  of  SEM  is  zero 
whenever  Jc/Jobs  >  2  (see  (4.4)). 

Note  that,  since  Theorem  3  (iii)  establishes  the  a.s.  convergence  of  p^'”'  to 
p*  €  (0, 1)  as  m  — >  00,  there  is  a.'=.  at  most  a  finite  number  of  p*"*)’s  outside 
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the  intervals  Jm-  This  means  that  the  choice  of  the  procedure  when 
exits  from  is  not  important.  This  is  the  reason  why  we  have  made  choice 
of  the  procedure  implicitly  contained  in  (4.5):  It  is  the  most  convenient  for 
proving  Theorem  3,  following  our  approach. 

Theorem  3  (i)  asserts  that,  under  assumptions  which  parallel  (Hl)-(H4) 
above,  the  Pn  -  distribution  of  =  p^’"^(x,w)  is  Px  -  a.s.  asymptotically 
normal  with  mean  pm  and  variance  where  p^  =  Pm(x)  is  the  ML 

estimate  of  p*  based  on  the  sample  {xi, . .  .,Xm}  and  u*  =  (cr*)^{l  — 
with  r*  and  a*  defined  in  (2.8)  and  (2.19),  respectively. 

Theorem  3  (ii)  provides  Px  —  a.s.  asymptotic  upper  bounds  for  |£^n(p^'”^)— 
Pm\  and  |Varn(p^"‘^)  -  u'm  *|.  These  bounds  make  (i)  more  precise. 

Before  stating  Theorem  3,  we  need  to  state  the  assumptions  (H5)  and 
(H6),  which  are  as  follows. 

(H5)  For  Px—  a.e.x,  —  i2(m)}“‘‘  <  oo  and  V^([^w])  < 

oo  for  any  0  <  0  <  1,  where  [f]  denotes  the  largest  integer  <  t. 

(H6)  There  exists  an  exponent  p,0  <  p  <  1,  such  that  m"**  =  0(c(m)).  (For 
p  =  0,  this  means  that  c(m)  is  constant.  Note  that  (H6)  implies  (H2).) 
Finally,  recall  that  ({m)  =  where  £2(^)  denotes  the  iterated 

logarithm.  We  can  now  state  Theorem  3. 

Theorem  3  (i)  Suppose  that  the  assumptions  (Hi),  (H3)-(H4)  and  (H6) 

hold.  Then,  w}!'^  (p^"*)  —  Pm)  converges  Px  -  a..s.  in  P^j  -  distribution 
to  a  Gaussian  r.v.  with  mean  0  and  variance  v“. 

(ii)  Under  the  assumptions  (Hi),  (H3)-(H4)  and  (H6),  forPx  -  a,.e.  x  and 
all  m  large  enough, 

l^n(p^"*^)  -  Pm|  <  m~^/^a,e,(m)  =  o(m"^''^)  (m  ^  oo)  (4.6) 

and 

I  ^«'’n^(P^"‘^)  -  <  2m~^l'^a„eq{‘rn)  +  (4.7) 

where,  for  all  e,Q<9<i, 

aie9("i)  =  0(/3([^m]))  +  0(m~'/*{l  - /2(m)}"^)  (m  — >  oo).  (4.8) 
(Hi)  Under  the  assumptions  (HI),  (H3)  and  (H5)-(H6), 

Ji^(p^”*^  -  Pm)  =  0  Px  ®  Pn  -  O..S.  (4.9) 


26 


(4.10) 


Futhermore,  under  (HI),  (H3)  and  (H6)  with  0  <  /i  <  1/4, 
lim  sup  —  Pm|  =  0  Px  ®  Pn  ~  a-s- 

m-*oo 

for  0  <  1/  <  min{(l  —  p)/8,  (1  —  ^n)l2,  (9  —  33p)/32}. 

(iv)  Under  the  only  assumption  that  c(m)  and  R{m)  are  constant  with  0  < 
c  <  1/2  and  0  <  R  <  1,  the  ARE  of  the  global  sequential  SEM  algorithm 
is  positive. 

Proof  of  Theorem  3.  The  proof  of  Theorem  3  parallels  that  of  Theorem 

1,  except  for  the  proof  of  (iv).  However,  since  it  is  delicate,  we  detail  each 

of  its  steps.  Step  1  contains  the  proof  of  some  technical  results  about  the 
rates  of  c(m),  R{m)  and  related  quantities  (m  — ♦  oo).  Step  2  establishes  a 
result  analogous  to  (3.14).  Step  3  constructs  the  Skorohod  representation  we 
need  and  state  the  estimate  (4.22),  analogous  to  (3.20).  Step  4  obtains  an 
estimate  of  the  rate  of  convergence  to  0  of  as  m  oo, 

where  Eu  denotes  the  expectation  with  respect  to  the  Skorohod  represen¬ 
tation  probability  space,  =  y^’”)(u)  =  m'/^{p^’"^(x,  u)  —  Pm(x)}  and 

=  Z('"^(u)}  represents  a  suitable  Gaussian  process,  such  that  the  r.v. 
Z^*"^  has  mean  0  and  variance  »  u*  as  m  — »  oo.  Step  5  deduces  the 

assertions  (i)-(iii)  of  Theorem  3  from  the  preceding  steps.  Finally,  Step  6 
establishes  the  assertion  (iv). 

We  now  proceed  to  explain  the  successive  steps  of  the  proof  of  Theorem 

2. 

Step  1  We  begin  with  some  technical  results  concerning  the  assumptions 
(H5)  and  (H6)  in  Theorem  3.  These  results  are  stated  in  Lemma  4.1,  which 
is  as  follows. 

Lemma  4.1  Suppose  that  the  assumption  (H6)  in  Theorem  2  holds.  Then, 
(i)  For  any  6,0  <  0  <  I, 

OO 

5^  m~^0^{[9m])  <  oo,  (4.11) 

m=l 

where  /3(m)  is  defined  as  in  (3.35)-(3.36)  and  [t]  denotes  the  largest 
integer  <  t. 
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(ii)  A  sufficient,  but  not  necessary,  condition  for  (H5)  to  hold  is  y,  <\ /A. 
(Hi)  We  have,  for  any  <  6  <\, 


and 


n  ^  as  m  oo 

k=l 

m 

R{k)  — »  0  as  m  — »  oo. 

k=[m6] 


The  proof  of  Lemma  4.1  can  be  found  in  the  Appendix. 


(4.12) 


(4.13) 


Step  2  In  Step  2,  we  establish  the  following  Lemma  4.2,  which  concerns 
an  estimate  of 

(>"  -*  oo),  (4.14) 

where  =  m*/^{p^”‘)(x,a;)  —  Pm(x)}  and  the  notation  Eq  refers  to  ex¬ 
pectation  with  respect  to  the  probability  measure  P^.  Notice  that  the  result 
that  we  obtain  in  Lemma  4.2,  namely  (4.15),  is  true  for  Px  -  a-e.  sample 
sequence  x  and  all  m  large  enough,  but  cannot  be  integrated  with  respect  to 

Px 

Lemma  4.2  Suppose  that  the  assumptions  (HI),  (H3)  and  (H6)  hold.  As¬ 
sume  in  addition  that  m{l  —  R{m)}^  —*  oo  as  m  oo.  Then,  there  exists 
for  Px  -  a. e.  x  a  finite  integer  mo  =  mo(x)  such  that  m  >  mo  implies 


Proof  of  Lemma  4.2  We  have  for  all  m  >  1 

<  (m  -h  -H  l)||r<’")||n,.  ,,  . 

-|-(m -h  l)^/^i2(m -t- l)A,„+i -f  ti;(m -f  1),  ‘ 

where  A„+i  =  Ip^+i  -  p,„l  and  u;(m)  =  (l/4)|l^<'"^|ln,4  1/4  Px  -  a.s.  as 

m  — ♦  oo  (see  (2.8),  (2.19)  and  (3.15)).  This  in  turn  leads  to 

||y<"‘’||n.4  <  /m  +  Ilm  +  Him  for  all  m  >  1,  (4.17) 
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where  =  m^/2nm(2)||y(^)l|n,4  =  o(l)  in  view  of  (H6)  and  Lemma  4.1, 


Ilm  =  and 

i=2 

m 

Him  =  '^(mljy^^Umij  +  l)w{j), 

i=2 


(4.18) 


and  n„,(j)  =  R{j)...R{m)  for  j  =  l,...,m,  with  the  convention  that 
rimCrn  +  1)  =  1. 

Since  R{j)  <  R{j  +  1)  for  j  >  1,  n„,(i)  <  n,„(j  +  1)  for  j  =  1, . . . ,  m  and, 
by  Lemma  A.4  in  the  Appendix,  there  exists  a.s.  a  finite  integer  mi  =  mi(x) 
such  that  m  >  mi  implies  Am  <  {2Jcl Jobs)^~^ ■,  splitting  Ilm  into  Am  + 
Brm  with  Am  —  and  Bm  —  Tfl  ^  S[g,„]+l<j<mnm(j)^j) 

where  0  <  ^  <  1  and  [t]  denotes  the  largest  integer  <  f,  yields  for  m  >  m2: 

Ilm  <  m^/^5^n„([0m])  +  m^/^(2Jc/46,)[^m]"^ 
i=2 

(H-il(m)  +  i22(m)  +  ...)  (4J9) 

<  m3/2n„,([0m])  +  m-^IH-\2JclJoh.){^  -  i?(n?)}-‘ 

=  o(l) 


in  view  of  (H6)  and  Lemma  4.1.  Splitting  Him  similarly  provides 

(ffm) 

Him  <  {ml2fl'^Bm{\6Tn]  +  l)[6m][9rn]-^^w{j) 

j=2 

+(m/[^m])^/^(max[e„]<j<,„  +  •  •  •) 

(4.20) 

[eml 

<  +  l)[^m]"^^u;(;)  +  -  f?(m)}”^ 

J=2 

(max[e,n]<j<m  w{j)). 

Since  the  Cesaro  mean  and  max[5„,)<j<,„  u;(j)  — >  1/4  as  m  00  and 
m^/^nm([^m]  +  1)  =  0(1),  (4.16)-(4.20)  together  imply 

l|y'”’||n,.  <  0(1)  +  (l/3)r''={l  -  (4.21) 
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whenever  m  >  m2  =  m2(x)  >  m^.  Choosing  9  such  that  (1/3)0  <  1 

leads  to  (4.15)  for  m  >  mo(x)  >  m2,  as  required.  □ 

Step  3  In  Step  3,  we  examine  the  construction  of  the  Skorohod  representa¬ 
tion  of  the  r.v.’s  ^m(p,w)  and  e<"*l(tu)  for  p  €  [0, 1]  and  m  >  1,  as  well  as  the 
estimate  (4.22),  uniform  in  p  €  [0, 1],  for  the  mean  square  distance  between 
these  respective  representation  r.v’s,  as  m  — »  00. 

Again,  (4.22)  is  true  for  Px  -  a.e.  sample  sequence  x  and  for  m  large  enough. 
It  cannot  be  integrated  with  respect  to  Px- 

Lemma  4.3  Under  the  assumptions  (Hi)  and  (H3)  of  Theorem  3,  there 
exists  a  probability  space  (U  =  {u  =  («i,«2, ■••)}; ^u)  o.nd  r.v.’s 
and  e{um),  m  =  1, 2, . . .,  such  that 

(i)  and  c(ii,„)  have  the  same  distributions  as  ^m(p,w)  and  el”*l(u;), 

respectively,  for  each  m  =  1,2, .. .  and  all  p  in  [0, 1]. 

(ii)  For  each  fixed  p,  the  r.v.’s  4m(P5«m)»^  =  1,2,...,  are  mutually  inde¬ 
pendent  and  the  r.v.  ’s  e{um.),fn  =  1,2, . . .,  are  i.i.d.  and  Gaussian  with 
mean  0  and  variance  1. 

(Hi)  ForPx  •  a.e.  x,  there  exists  a  finite  integer  mz  =  m3(x)  >  mo(x)  such 
that,  for  all  m  >mz  and  p  in  [0, 1], 

^u(km(p,t‘m)  -  e(«m)n  <  107^/^(m)|  log{7(m)}  -t-  27*/^(m), 

(4.22) 

where  7(m)  =  CB£;[r'mc(m)]~^^^  is  as  in  Section  3  (see  (3.20)). 

The  proof  of  Lemma  4.3  completely  parallels  that  of  Lemma  3.5,  and  is 
thus  omitted. 

Step  4  In  Step  4,  we  show  that  there  exists  a  Gaussian  process  = 

2(”‘)(u)),  defined  on  the  Skorohod  representation  probability  space  (U,Pu) 
introduced  in  Step  3,  such  that  the  versions  =  F^'"l(x,  u)  of  m^/*{p^’")(x, 
<^)  ~  Pm(x)}  defined  by  making  use  of  the  r.v.’s  ^m(p,Wm)  introduced  in 
Lemma  4.3  are  close  to  =  Z^"*^(u)  in  the  mean  square  sense,  for  Px  - 
a.e.  X. 

More  precisely,  let  {Z^”*^(u)}  be  the  linear  autoregressive  Gaussian  process, 
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defined  on  (U,  Pu)  by  the  recursion  formula 
^(m+i)(u)  ^  ^  +  for  m  >  1,  (4.23) 

with  Z^^^(u)  =  cric^^^(u).  Then, 

m 

Z<"‘^(u)  =  53(m/j)^/^7r„,(j  +  l)<rje(^^(u)  for  m  >  i,  (4.24) 

i=i 

where  irm{j)  =  rj  ■•■f'm  for  j  =  1, . . . ,  m  and  nmim  +  1)  =  1,  is  a  Gaussian 
r.v.  with  mean  0  and  variance 

+  (4-25) 

j=i 

Moreover,  define  y(’”)(x,  u)  =  m^/^{p^'"^(x,  u)-p,„(x)},  where p^®^(x,  u)  = 
and,  for  m  >  0, 

p(m+i)(x^u)  ^  r„,+i{p('")(x,u)} +  (m  + l)-*/2 

5„+i{P<"*Hx,u)}e‘'"+^)(x,u), 

where  ^(’"+^)(x,u)  =  4m+i{p^’"Hx,u),u„,+i}. 

In  the  sequel,  we  will  suppress  the  notation  indicating  the  dependence  on 
u  or  (x,u),  unless  necessary.  We  will  let  HV^Hu.o  =  (1^1“)  for  all  finite 

a  >  1. 

Lemma  4.4  below  provides  an  estimate  for  £'u(|T^”‘^  —  Z^”*^!^)  as  m  — > 
oo.  Tins  estimate  is  uniform  in  p  G  [0, 1]  and  is  true  for  Px  -  a.e.  x. 
Unfortunately,  the  integration  of  inequality  (4.27)  with  respect  to  Px  does 
not  lead  to  a  similar  estimate. 

Lemma  4.5  below  shows  that  the  variance  of  Z^”*^  converges  to  v"  as 
m  —*  oo  and  provides  an  estimate  for  ~  v*|- 
Lemma  4.4  is  as  follows. 

Lemma  4.4  Under  the  assumptions  (Hi),  (H3)  and  (H6)  and  the  additional 
assumption  m{\  —  72(771)}“*  —*  oo  as  m  ^  oo,  the  uniform  (in  p)  Skorohod 
representations  and  Z^"*^  satisfy  Px  -  a  s. 

l|K(m)_z(-)|Iu  2  <a,e,(m)  (4.27) 

for  all  m  large  enough,  where,  for  all  6,Q  <  9  <  I, 

o-aegim)  =  O(/9([0m]))  +  0(m~*''^{l  -  /2(m)}"^)  (m  — >  oo).  (4.28) 


31 


Proof  of  Lemma  4.4.  It  follows  from  Lemmas  3.5  and  3.4  that,  for  all 
m  >  1  and  Px  -  a.e.  x, 

||K(".+.)  -  ,  <  (m  + 

||y<">-Z(”l||u,2+s(m  +  l), 

where  is  a  sequence  of  positive  numbers  decreasing  to  zero  such  that, 

for  all  m  >  2  and  Px  -  a.e.  x. 

g{m)  <  p{m)  + 

4-2>lo(m  +  A^l|y(”*-')llu.4 

+^oA„  +  ylo(m  +  ^ 


<  ^(m)  +  2i4om  — /?(m)}  ^  +  o(m  '^^{1  — /?(m)}~^), 

(4.30) 

in  view  of  Lemma  A.4  and  4.2. 

Now,  (4.29)  entails  that,  for  all  m  >  2, 

liyl")  -  Zl”-)||u,2  <  n.''V„(2)ir,+  +  l)s(i),  (4.31) 

>=2 

where  Ky^  =  HyO)  -  Z^^\\v,2  <  oo  Px  -  a.s. 

Since,  for  all  r  such  that  r*  <  r  <  1,  there  exists  Px  -  a.s.  a  finite  integer 
m4  =  m4(x)  >  m3  such  that  0  <  <  r  for  all  m  >  m4,  the  same  splitting 

technique  as  in  the  proof  of  Lemma  4.2  yields,  for  all  [0m]  >  m4, 

[em] 

||y(*n)  _  Z<^)|lu,2  <  m^/2MAXr”*-’"^/i'i  +  5;(m/2)V2 
MAXr— K''"ly(i)  +  {m/[em]y/^g{[em]){l  +  r  +  +  . . .), 

where  MAX=niax(l, r2  . . . ).  Thus,  Px  -  a.s., 

[em] 

||y(m)  _  ^(m)||y  2  ^  0(mi/V”*)  +  O(m3/2r(i-«)"*[0m]-i5;y(j)) 

+O(5([0m]))  (m  -V  00) 


=  0(1)  +  O(</([0m]))  (m  00), 


(4.33) 
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which  gives  (4.27)-(4.28),  in  view  of  (4.30),  as  required.  □ 

We  now  turn  to  Lemma  4.5. 

Lemma  4.5  ForPx  -  a-e-  x,  we  have 

(m  oo),  (4.34) 

where  i{m)  =  {2^2(j^)}^^^  £2(1^^)  the  iterated  logarithm. 

Proof  of  Lemma  4.5.  We  again  use  the  same  splitting  technique  as  above, 
but  with  /(m)  =  m  —  s(m),s(m)  — »  00  and  s(m)  =  o(m)  asm  00,  instead 
of  /(m)  =  [6m\.  We  have 

|y(m)  _  u-|  <  m  {rj(^) .  ..rmfa]  -f  |o-^  -  (<t*)2| 

+km  -  («r*)^|m(m  -  1)'V^  +  . . .  +  -  (a*)2|  (4-35) 

m/(m)-i(r/(m)+i .  •  •  +  (o-*)^|l  +  (m  -  1)'V^  +  . . . 

+  s(m){m  -  s(m)}-^(r/(,n)+i  •  •  -rm)^  -  {1  -  (r*)*}”^  1- 

But,  for  j  >  m4(x),rj  <  r  <  1,  whereas,  by  Lemma  A.l,  |cr|  —  (cr*)^l  = 
0U-''^(U))U  — ►  00)  Px  -  a.s.  Thus,  for  m  large  enough, 

+((T*)2[(m  -  1)-^ |r^  -  {r*y\  +  ...  +  s(m){m  -  s(m)}-* 
l(r/(,„)+i...r,„)^-(r*)2*(-)l] 

+(cr*)^|l  +  (m  —  l)"^(r*)^  +  . . .  +  5(m){m  -  s(m)}“^ 

(r*)2-("‘)  -  {1  -  (r*)2}-i. 

(4.36) 

Now,  the  first  term  in  the  RHS  of  (4.36)  is  0(mr^*^’”)  (m  — »  cx))  ;  the 
second  term  is  0(m"^^^f(m))  (m  — >  00)  because  of  properties  of  s(m)  as 
m  — ♦  00  ;  the  third  term  is  0(m~*^^f(m))  (m  — >  00)  :  In  view  of  Lemma 
A.l,  \r^  -  (r*)2|  =  0{j-^^^£{j))  {j  -4  00)  Px  -  a.s.,  thus  l(r,„__j  . . .  r,„)2  - 
(r*)^-'|  <  jr^^O{m~^t'i£(m))  (m  — ►  00)  if  1  <  j  <  s{m),  with  jr'^^  = 
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jexp{— 2|  logr|j}  <  supj.>o  I  exp  {—2|  log  rja;}  =  (2e|logr|)~^  and  (m— 1)"^  + 
. . .  +  s{m){m  —  s(m))“^  <  5^(m){m  —  s(m)}~'  ~  m“*s^(m)  =  o(l)  (m  — ♦ 
oo)  ;  the  fourth  term  is  bounded  by  o(l)  +  0((r*)^*^'"^)  ;  finally,  choosing 
Alogm  <  s(m)  =  o{m)  (m  — ♦  oo)  with  A  >  (2|logrl)“*  above  completes  the 
proof.  □ 

Step  5  We  are  now  in  a  position  to  deduce  the  assertions  (i)-(iii)  of  Theorem 
2  from  Lemmas  4. 1-4.5. 

Proof  of  (i):  Since,  by  Lemma  4.5,  converges  in  Pu  -  distribution  as 
m  — >  oo  to  a  Gaussian  r.v.  with  mean  0  and  variance  u".  Lemma  4.4  implies 
that  converges  in  Pu  -  distribution  as  m  — »  oo  to  a  Gaussian  r.v.  with 
mean  0  and  variance  u*,  for  Px  -  a.e.  x.  Thus,  the  same  is  true  for  yl’"^(x,u;) 
in  P(j  distribution.  Hence  (i). 


Proof  of  (ii):  From  Lemma  4.4  and  the  Cauchy-Schwarz  inequality,  for  Px  - 
a.e.  X  and  all  m  large  enough. 


|£;u(p("*)  _p^  _  l/22(m)j|  _  |£;y(p(m))  _p^| 

=  -Pm) 


(4.37) 


Hence, 


-Pm|  <  m  ^/^a,e,(m)  =  o{m~^/^)  (m  oo).  (4.38) 

We  now  turn  to  the  variance  of  p^"*!  with  respect  to  Pn,  Var^{p^'^^). 

We  have  for  Px  -  a.e.  x  and  all  m  large  enough, 

-  Pmllu,!  -  -  Zl"l|lu,2.  (4.39) 


Thus, 

|llp'"’  -  P™llfl,2  -  (4.40) 

Finally,  from  Lemmas  4.4  and  4.5  and  (4.37)-(4.40)  it  follows  that 


|Var;,'’(pl”‘l)  -  llpl"!  -  p„l|n,,|  <  Il£;n(p'"')  -  p™|b.2. 


(4.41) 


Thus, 

|Varn/^(p(™))  -  m-'/2(u("‘))i/2|  <  2m-^l^a,,,{m)  (4.42) 

and 

jVarj2^(p^”*^)  —  <  2m~^l'^aaeq{rn)  +  0{m~^£{m)).  (4.43) 

This  completes  the  proof  of  (ii). 


Proof  of  (iii):  Let  Pm{£)  and  $„i(t),— oo  <  t  <  oo,  denote  the  d.f.  of  the 
r.v.’s  p(”*l(u)  —  Pm  and  m"*/^Z^”*^(u),  respectively,  i.e.,  for  Px  -  a.e.  x  and 
m  >  1, 

f  Pm(<)  =  Pv{p^"^-Pm<t}  ,  . 

I  $,„(^)  =  (4-  4) 

Lemma  4.4  implies  that,  for  Px  -  a.e.  x,  all  h  >  0  and  all  m  large  enough, 
sup  \Pm{t)  -  ^m(01  <  SUp  (l>m{s), 

-oo<t<oo  t-fcm->/2<5<t+/,m->/2 

(4.45) 

where  (l>m{s)  =  {d/ds)^m{s)  is  the  normal  density  function  with  mean  0  and 
variance  Let  t  =  T(m)  >  0  be  given.  Then,  letting  h  =  in 

(4.45),  we  obtain  that  for  Px  -  a.e.  x,  all  m  large  enough  and  all  t  >  t, 

\Pm{t)  -  <  T-^m-'^al  (m) 


+  *  exp{— (l/2)m(u^”*))~^(t  —  t)^} 


and,  for  all  t  <  — t. 


|Pm(t)  -  ^m(OI  <  T  +  Tm*/^(27ri;<'">) 

exp{— (l/2)m(u^"*^)“*(<  +  r)*}. 


(4.46) 


(4.47) 


Since  Puilp^"*’  -  Pm|  >  t}  =  (1  -  Pm)(0  +  Pm{-t)  for  all  t  >  0,  picking 
a  sequence  {T(m)}  of  positive  numbers  such  that 


E  T  ^(m)m  ^a*,,(m)  <  oo 


(4.48) 


and  a  sequence  {t(m)}  such  that  t{m)  >  T(m)  for  all  m, 


£(1  -^m){t{m)]  <  oo 


(4.49) 
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and 


OO 

^  T(m)m^/^exp[-(l/2)m(n^”‘^)“^{<(m)  -  T(m)}^]  <  oo  (4.50) 

m=l 

yields,  in  view  of  (4.46)-(4.47), 

Pu{lim  sup  -  Pm\  <  1}  =  1  Px  -  a.s.  (4.51) 

m— *oo 

It  is  possible  to  choose 

t{m)  =  cst.mT''  (4-52) 

for  any  positive  constant  cst.  and 

0  <  1/  <  min{(l  -  /i)/8,  (1  -  4//)/2,  (9  -  33/x)/32}.  (4.53) 

For  (H6)  with  /x  =  0  (i.e.,  c(m)  =  constant),  it  is  possible  to  choose 

t{m)  =  (4.54) 

for  any  positive  constant  cst.  This  concludes  the  proof  of  (iii). 


Step  6  In  this  last  step,  we  consider  the  ARE  of  the  global  sequential  SEM 
algorithm  and  prove  assertion  (iv)  of  Theorem  3.  This  is  the  subject  of  the 
following  lemma. 


Lemma  4.6  Under  the  assumption  that  c{m)  =  c  =  constant  and  R[m)  = 
R  =  constant,  with  0  <  c  <  1/2  and  0  <  R  <  1,  the  ARE  of  the  global 
sequential  SEM  algorithm  is  positive. 

Proof  of  Lemma  4.6  It  suffices  to  show  that 

-  PmH  =  (m  <x>),  (4.55) 

where  the  expectation  symbol  E  stands  for  Exxil-  To  this  end,  we  introduce 
the  sigma-fields  generated  by  a:i,...,Xm  and  (m  >  1). 
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a.s. 


We  have 

+{m  +  l)-'£{S»„(p(’"l)|7-„) 

4-(l/2)(m  +  1)~*  a.s. 

<  -Pmp  +  E{\pyn+i  -Pm?\Tm) 
+2£(|p(”‘>  -  Pm||Pm+l  -  PmW^m) 
+(l/2)(m  +  1)“^  a.s. 

<  R^\p^'^^-pm?  +  E{\pm^^-vM^m) 

^.2|p(”»)  _  p^l£i/2(|p^^i  _  p^\^\Tm) 

+(l/2)(m  +  1)"’  a.s. 

(4.56) 

If  we  tahe  the  expectation  of  both  members  of  (4.56),  we  obtain,  by 
making  use  of  the  Cauchy-Schwarz  inequality, 

-Pm+lll2  <  -Pmlll  +  l|Pm+l  -  Pm\\l  I  a  =7', 

+2((p"‘  -p,„(i2((rm+l  -Pm((2  +  (l/2)(m+  1)-^ 

Now,  since  p^  is  the  ML  estimate  of  p*  based  on  {xi, . . .  ,Xm},  we  have 
llpm  -  P*ll2  =  E{\pyn  -  P*|)^  ~  (mJo6.)“^  as  m  oo.  Thus,  for  any  vo  >  0, 
there  exists  a  finite  integer  M{vq)  such  that  m  >  M{vo)  implies 

||p<"*+'>  -Pm+zlli  <  -P..lli  +  4(1  +  ./o)^/^(mJ„„)-i/2||p(-)  -P,„|l2 

4-2(1  +  v>o){mJobs)  ^  +  (l/2)m  ^ 

(4.58) 

If  we  define  A{i/o)  =  A  =  2{1  +  B{i/o)  =  B  =  2{l  +  i/q)J~^1  4- 

(1/2)  and  =  Hp^”*^  —  PmlU,  then  (4.58)  becomes 

Pm+i  <  B^Vm  +  2Am~'l'^ym  +  Bm~^  (4.59) 

for  all  m  >  M{vq).  We  now  prove  by  induction  that  there  exists  an  integer 
Ml  >  M{uo)  and  a  positive  constant  K  such  that  m  >  Mi  implies  tfm  < 
To  this  end,  assume  that  t/m  ^  for  some  positive  K  and  m 

large  enough.  Then,  in  view  of  (4.59), 

Pm+i  <  A'"  +  2A/\  4  B)m-‘ .  (4.60) 
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The  RHS  of  (4.60)  is  lesser  than  K^(m  +  1)"'  if  m  is  large  enough  and 
(1  -  R^)K^  -  2AK  -B>0.  (4.61) 

But  (4.61)  holds  whenever  K  >  {I  —  R^)~^[A  +  {A^  +  B{1  —  □ 


This  concludes  the  proof  of  Theorem  3.  U 

Remark  4.1  Theorem  3  (ii)  implies  under  the  additional  assumption  (H5) 
that  forPx  -  a.e.  X 


m  -  Pj\ 

j=i 


and,  similarly, 


—  (m  -+  oo), 


(4.62) 


m 

^  1  {m co).  (4.G3) 

j-i 


Remark  4.2  The  assertion  (4-55)  in  the  proof  of  Lemma  4-6  entails  that 
p^"*^  is  asymptotically  unbiased,  since 

|£xxn(p<"')-Cxxn(p-.)l  <  ^Xxodp'"' "P-l") 

=  0{m  (m  — »  oo) 


and  ExxSliPm)  =  Ex{Pm)  =  P* • 


Remark  4.3  The  estimates  in  Theorem  3  are  nonoptimal,  since  they  es¬ 
sentially  rely  on  an  Z/^(Pu)  estimate,  namely  (4.27).  It  can  be  conjectured 


that 


lim  sup  m  Vancxo(p^’"’)  >  v*  +  Job\-  (4.65) 

m— *00 
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We  now  try  to  support  this  conjecture.  First  of  all,  we  know  from  (4.38) 
and  (4.43)  that  -  Pm(^)  =  -f  o(m"’)  (m  -+  oo)  Px  -  a.s. 

Thus,  limsupnj  -  Pmp)  >  V*.  Next,  the  ML  estimation  theory 

implies  that  limsup^  Exi\Pm  -P*P)  =  J^bl-  Finally,  the  sample  fluctuations 
of  m^''^(pm-p*)  and  m'/^{£n(p^'”))-pm}  can  be  guessed  to  be  asymptotically 
uncorrelated  :  indeed, 

\Ex.n{m-^^HPm  -  <  £^x(m’/^K-p‘|  ||D^”>lk2) 

<  E][^{m\pm  -  p'P)  ||^‘’”dlxxn,2 

=  <9(|lD‘"*dlxxf2.2)  (m-^oc). 

(4.66) 

where 

Now  it  can  be  expected  that  —  2^^'"dlxxn,2  0  as  m  — >  oo. 

But,  since  Z^”*)  has  Pq  -  mean  0, £’xxn{"^*^^(Pm  —  p*)(F*"‘^  —  Z^"'!)}  = 
Exxn{”^'^^(Pm  —  P*)F^"*^},  with  y^”*^  =  m*/^(p^"*^  —  Pm).  From  a  heuristic 
point  of  view,  (4.65)  tells  us  that  the  variance  of  pl’"^(x,a;)  can  be  split 
into  the  variance  of  Pm  and  the  variance  of  the  fluctuations  related  to  the 
simulation  S-step,  the  latter  being  of  a  magnitude  >  v’7v~^  as  m  — ♦  oo. 
Recalling  that  u*  =  +  {JclJcond)]~^  <  {2Jobs)~^,  the  conjecture  (4.65) 

would  imply,  if  it  were  true,  that  the  ARE  of  the  global  sequential  SEM 
algorithm  is  <  [1  +  {1  +  {Jc/ Jcond)}~^]~^ •  If,  furthermore,  the  inequality 
in  (4.65)  could  be  replaced  by  equality,  then  the  ARE  would  be  >  2/3  and 
would  converge  to  1  as  Jobs! Jc  converges  to  1,  i.e.  as  the  mixture  becomes 
more  and  more  separable. 

Remark  4.4  As  for  the  one-step  sequential  SEM  algorithm,  the  global  se¬ 
quential  SEM  algorithm  can  be  considered  as  a  sequential  Bayesian  algorithm. 
Here,  the  underlying  Bayesian  algorithm  is  Tanner  and  Wong’s  (1987)  one. 

Remark  4.5  As  for  Theorem  1,  extension  of  Theorem  3  (i)  to  the  case  where 
the  mixture  has  K  >3  components  is  straightforward. 

Remark  4.6  As  for  Theorem  1,  extension  of  Theorem  3  (i)  to  a  general 
mixture  setup  has  been  proved  in  Celeux  and  Diebolt  (1986b)  under  the  strin¬ 
gent  assumption  that  T^ip)  has  only  one  fixed  point  in  the  compact  Gn  cor¬ 
responding  to  Jn.  This  result  suggests  that  a  similar  result  holds  in  very 
general  incomplete  data  settings. 
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APPENDIX 


Proof  of  Lemma  2.1 

Proof  of  (i);  From  (2.3)  and  (2.4), 

T'M  =  N-'^'^fi{xi)f2{xi)h-^{xup)  >  0  for  all  p  in  (0, 1),  (A.l) 

i=l 

where  h{x,p)  =  pfi{x)  +  (1  —  p)/2(®)  (see  (2.1)). 

Proof  of  (ii);  From  (2.3)  and  (2.5), 

^n(p)  = 

«=i 

=  H  \p~^Kxi,p)  -  (1  -  p)"Mi  -  K^i^p)]] 

=  p-^l  -  P)"^H  [(1  -  p)t{^i^P)  -  P{1  -  ii^i^P))] 

«=i 

=  p“^(l -p)"^l]{<(x<,p)-p}, 

i=l 

hence  (2.6). 

Proof  of  (iii):  From  (A2), 

L'siP)  =  -Zi{/i(a:i)  -  f2{xi)}^h-^{xi,p)  <  0  for  all  p  in  [0,1].  (A.3) 

i=l 

Now,  either  /i(x,)  =  /2(x.)  for  i  =  1, . . . ,  A  or  these  exists  i,l  <i  <  N, 
such  that  fi{xi)  /  Mxi).  In  the  first  case,  we  have  Ljv(p)  =  0,  Ln{p)  is 
constant  and  Tivip)  =  p  on  [0,1].  In  the  second  case,  L%{p)  <  0  for  all  p  in 
[0, 1]  ;  thus,  Ln{p)  is  concave  on  [0, 1]  and  has  a  unique  maximizer  on  [0, 1]. 
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Furthermore,  for  all  a:  in  and  p  in  (0,1), 

{fi{x)  -  f2{x)yh-^ix,p)  <  2p-^t^ix,p)  +  2{l-py{l -tixi,p)y 

<  2{p“^  +  (1  -  p)“^}  (since  0  <  t(a:,p)  <  1). 

(A.4) 

Thus,  the  SLLN  implies  that  for  Px  -  a.e.  x  in  X  and  all  p  in  (0, 1) 

N-^L'^{p)  L"{p)  =  -  -  f2{x)}^h-'^{x,p)h{x,p‘)p{dx)  ,  . 

as  TV  oo.  ^  ’ 

By  the  assumptions  of  Theorem  1,  L'\p)  <  0.  Hence  (iii). 

Proof  of  (iv)  :  From  (2.6), 

r);,(p)-l  =  (l-2p)A-UV(p)+P(l-P)A^''iN(p)  for  all  pin  (0,1).  (A.6) 

Thus,  for  p  =  pNi  where  L'hj{pn)  =  0  and  L'1^{pn)  <  0,  we  have 

TUpn)  =  1  +PNil-PN)N-^L%{p)  <  1.  (A.7) 

As  T'i^{pn)  >  0  and  Tn{pn)  =  Pn  again  by  (2.6),  (2.7)  is  proved. 
Furthermore,  since  L'/^{p)  <  0  for  all  0  <  p  <  pn  and  L'j^[p)  >  0  for  all 
Pa^  <  p  <  1,  the  remainder  of  (iv)  obtains  again  from  (2.6).  (Compare  the 
proof  in  Silverman  (1980).) 

Proof  of  (v)  :  Assertion  (v)  is  a  direct  consequence  of  (iv)  and  its  proof 
will  be  omitted  here. 

Proof  of  (vi)  :  Remark  that  the  empirical  complete-data  information  value 
Jn,c  =  P7v*(l  —  Pn)~^  fhe  empirical  observed-data  information  value 
JN,oha  =  L'I^{pn)  (e.g.,  Titterington  et  al  (1985)).  Thus,  from  (A.7), 

TN  =  T'j^ips)  =  1  —  JN^c'^N,obs-  (A. 8) 

Since  J N ^ohs  ^  *7^  dobs  as  AT  >  oo  for  Px  ”  a.e.  x  and  —  Jobs  d”  Jcond^ 
(vi)  obtains. 

It  is  worth  noting  that  (A. 8)  also  results  from  a  general  relation  in  Demp¬ 
ster  et  ai  (1977)  and  that  the  information  ratio  J~^Jobs  measures  “the  pro¬ 
portion  of  information  about  p  without  knowing  the  subpopulation  member¬ 
ship  [. . .]  (and)  might  be  interpreted  as  the  ability  of  the  data  to  distinguish 
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the  component  densities”  (Windham  and  Cutler  (1991)).  Indeed,  it  is  well- 
known  (e.g.,  Louis  (1982)  and  Sundberg  (1976))  that  the  convergence  rate  of 
the  EM  algorithm  is  the  largest  eingenvalue  of  the  matrix  1  —  J~^Joba,  which 
is  coherent  with  (vi)  above.  □ 

Lemma  A.l  (i)  For  Px  -  a-c-  x,  all  N  and  all  p  in  (0, 1), 

0<T;<(1/2)p->(1-|>)-'  (A.9) 

and 

|rN(p)l  <  P"^(l  -  p)~^  (A. 10) 

Proof  of  Lemma  A.l.  Inequality  (A.9)  results  from  (A.l)  and  the  elemen¬ 

tary  inequality 

2p(l  -p)h{x)f-i{x)  <  {pfi{x)  +  (1  -p)/2(a:)}^  (A.ll) 

Similarly, 

TUp)  = 

and 

2p(l  -  p)fi{x)f2{x)h-^{x,p)  <  1, 
whereas 

Mx)\h~\x,p)  <  fi{x)h-^{x,p)  +  f2ix)h-\x,p)  =  p~H{x,p)  +  {\  - 
p)-*{l  -  <(a;,p)}  <  p~^  +  (1  -p)~^  =  P'Ml  -P)"'-  □ 

Lemma  A. 2  ForPx  -  a  e.  x, 

rjv  =  inf  Tif{p)  — +  r'  =  inf  T'(p)  >0  as  m  oo  (A. 12) 

pe(o,i)  p€(o.i)  ^  ’ 

where  T'{p)  =  / /i(x)/2(x)/i"^(x,p)ft(a;,p*)p(da;)  =  \\mN^ooT'i^{p),0  <  p  < 
1. 

Proof  of  Lemma  A. 2.  It  is  an  easy  convexity  fact  that  r'^y  =  TI^{pinf,N) 
and  r'  =  T'{pinf),  where  p,n/,Ar  and  p,„/  denote  the  inflexion  points  of  TM 
and  T'(p),  respectively,  i.e.  rjv(pm/,yv)  =  T"{pinf)  =  0.  Furthermore,  it  can 
be  shown  that  r)y  and  r'  are  in  (0, 1).  Let  6,0  <  fc  <  1/2,  be  so  small  that 
T'{b)  and  T'{1  —  b)  are  >  1.  Since  T'^{p)  and  T"{p),0  <  p  <  1,  are  increasing 
functions,  Dini  lemma  implies  that  \\Tj(,-T''\\i  =  suppg/  \TI^{p)-T"{p)\  — »  0 
as  A  — >  oo,  where  /  =  [6, 1  —  6].  Also,  Tl^{b)  and  Tj^ll  —  b)  are  >  1  for  all  N 
large  enough,  and  p,„/,Ar  and  p,„/  are  in  I. 
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Finally,  -  r"(pi„,,N)|  =  |r"(p,„,,A,)|  <  IITiS  -  T'W,  ^  0  as 

m  — f  oo.  Thus,  T"{pinf,N)  —*  0  tts  N  oo,  which  implies  that  pinf,N  —*  Pin] 
as  N  oo  and  =  Tj^{pinj,N)  — >  r'  =  T'{pinf)  as  »  oo.  □ 

Lemma  A. 3  For  Px  -  a.  e.  x  and  all  N  large  enough, 

|SV(P)I  <  (l/2)p-'(l  -  p)-‘  +cs(.p-"'“(l  -  p)-“/",0  <  p  <  1,  (A.13) 

where  cst.  denotes  some  positive  constant. 

Proof  of  Lemma  A. 3  Since 

SV=P-''“(1 -2p){Ti;,(p))'«  +  (l/2)p'/’(l-p)'/= 
{ri,(p)}-'/’r;(p), 

we  have  for  all  p  in  (0, 1)  that 

|S;(P)I  <  (I/2)p-'(l  -  P)-'  +  {l/2)ri,-‘'VV^(l  -  p)-=>\  (A.15) 

in  view  of  (A. 9)  and  (A.  10).  □ 

Lemma  A.4  (i)  We  have 

=  0(£(Ar)iV-‘/2)  (A^ oo).  (A.16) 

(ii)  We  have 

\<Ts  -  a*\  =  OieiN)N-^^^}  {N-^oo).  (A.17) 

Proof  of  Lemma  A.4  Proof  of  (i):  From  the  general  theory  of  ML  esti¬ 
mation,  \pn  —  p*|  =  0{i{N)N~^^^)  {N  —*  oo)  Px  -  a.s.  Thus,  for  N  large 
enough,  we  have  in  view  of  Lemma  A.l  that 

|rA,-r-l  <  TOpn)  -  W)1  +  Top')  -  r(p-)| 

<  0{i(N)N-'i^)  +  ir^ip-)  -  r(p-)|. 

But  T'ff{p*)  —  T'{p*)  has  the  form  where  the  r.v.’s  Ui  are 

i.i.d.,  bounded  and  have  mean  0.  Thus,  by  the  LIL,  —  T'(p*)|  = 

0{e{N)N-^l^)  {N  oo). 

The  proof  of  (ii)  proceeds  similarly.  □ 

Proof  of  Lemma  2.2  By  the  duality  principle  of  Diebolt  and  Robert  (1992), 
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it  is  enough  to  prove  that  the  sequence  is  ergodic.  Since  it  is  a  finite- 

state  homogeneous  Markov  chain,  it  is  enough  to  prove  that  all  the  transition 
probabilities  are  positive.  Now,  define 

An  =  |z  €  Z  :  iV-'  ^  z.  G  Jn|  ,  ( A.19) 


where  Z  =  {0, 1}^  has  2^  elements.  If  a  and  b  are  any  elements  of  Ajv, 

=  a}  = 

Pq{z^”*+^^  =  blz^"*'*’^^/^^^  €  An, 2^*"^  =  a}  G  AnIz^"*^  =  a) 

-|-Pn{z^’"'*'^^  =  blz^’"'*'^^/^^)  ^  An,z^”*^  =  a)  ^  AnIz^"*^  =  a}, 

(A.20) 

with 

Pjj{z(-+(»/2))  g  AnIz^'”)  =  a)  >  0  (A.21) 

and 

Pj^{z(-+(i/2))  ^  An|2<’">  =  a)  >  0,  (A.22) 

since  all  the  states  z  €  Z  can  be  reached  from  z^*"^  with  positive  probabil¬ 
ity  (because  t{x,p)  is  in  (0,1)  for  all  p  in  (0,1)).  Moreover,  Pn{z^'”'^*)  = 
.|2(">+(i/2))  ^  An}  is  a  given  probability  distribution  related  to  Fn,  whereas 


Pn{z<'"+*)  =  g  An,z^”*>  =  a}  =  Pn{z<’"+l*/2))  _  b|zl”*)  =  a} 

N  ^ 

=  ■{  1  -  I  >  0. 

i=l  i=i  j=l  J 

(A.23) 


This  completes  the  proof  of  Lemma  2.2. 


□ 


Proof  of  Lemma  3,1  The  proof  is  very  similar  to  that  of  Lemma  2.2, 
and  is  thus  omitted.  □ 


Proof  of  Lemma  3.3  We  prove  (3.18).  By  the  quadratic  Taylor  formula, 
we  have  if  |/i|  <  2co  and  N  is  large  enough  that 


\Tn{pn  +  h)-  Ti^{pn)  -  hr^l  <  (l/2)/i^  sup  ir^(7^N  +  0h)\ 

o<e<i 

<  aoh^, 


(A.24) 
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for  some  constant  cq,  in  view  of  (3.17).  If  \h\  >  2co,  then,  since 

\fN{pN  +  h)-  fsiPN)  -hrN\<  3,  (A.25) 

it  is  enough  to  choose  Aq  >  oq  such  that  Ao(2eo)^  >3.  □ 

Proof  of  Lemma  3.4  It  is  similar  to  that  of  Lemma  3.3  except  that  we 
make  use  of  a  linear  Taylor  expansion  rather  than  a  quadratic  one.  □ 

Lemma  A. 5  Wt  have 

|Pm+i -Pm\  <  m-'JMlil  +  Oil))  (m  ^  oo).  (A.26) 

Proof  of  Lemma  A5  By  definition,  L'^{pm)  =  L'^+i{Pm+i)  =  0.  By  a  linear 
Taylor  expansion  of  L^(p), 

Lm{Pm)iPm+l  —  Pm)  +  <5(|Pm+l  ~  PmP)  =  ~{^/^P)  iog  ^(^m+l,p)|p=pm+l 

=  — (/l  — f2){^m+l)h  1  Pm+1  )• 

(A.27) 

Since 

m  ^ L'^{prn,)  *  Jobs 

and 

l/i  -  f2\{x)h~^{x,p)  <  p"Hl  -P)~‘  =  Mp)  for  all  ^  and  p  in  (0, 1),  the  proof 
is  complete. 

Proof  of  Lemma  4.1  Assertion  (i)  is  straightfoward  since  under  (H6) 
^(m)  =  0(m“*’)  for  some  positive  6.  Assertion  (ii)  results  from  the  following 
inequality: 

e(m)  =  1  -  R{m)  >  cst.c{m)  (m  — ^  oo).  (A.28) 

We  now  turn  to  (iii).  Since  1  -  R(m)  =  e(m)  >  cst.m~^  for  some  positive 
cst., 

{ml  m 

n72(fc)? +  (3/2)logm  =  J])log(l-e(/:))  +  (3/2)logm 

*=1  J  k=l 

m 

~  -5^e(fc)  +  (3/2)logm 

fc=i 

<  — cs<.(l  —  +  (3/2)  log  m 

—*  — oo  as  m  — ►  oo. 
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The  proof  of  (4.13)  is  similar. 


□ 
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The  purpose  of  this  paper  is  to  study  the  asymptotic  behavior  of  the  Stochastii; 
EM  algorithm  (SEM)  in  a  simple  particular  case  within  the  mixture  context. 

We  consider  the  estimation  of  the  mixing  proportion  p  of  a  two-component  mxiture 
of  densities  assumed  to  be  known.  We  establish  that  the  stationar  distribution 
of  the  ergodic  Markov  chain  generated  by  SEM  is  asymptotic,  as  the  sample 
size  N  tends  to  infinity,  to  a  Gaussian  distribution  with  mean  the  bonsistent 
maximum  likelihood  estimate  of  p  and  variance  proportional  to  N-1/2. 
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we  determine  the  limiting  distributions  of  two  sequential 
versions  of  SEM  and  study  their  asymptotic  relative  efficiency. 


